[2026-03-25 15:35:42,837][mllm.models.large_language_model_local][INFO] - Initializing adapter 'agent_adapter': no initial weights provided or found; starting from scratch. [2026-03-25 15:35:43,580][mllm.models.adapter_training_wrapper][INFO] - Adapter 'agent_adapter': initialized with fresh weights (no initial weights found). [2026-03-25 15:35:43,587][mllm.models.large_language_model_local][INFO] - Initializing adapter 'critic_adapter': no initial weights provided or found; starting from scratch. [2026-03-25 15:35:44,279][mllm.models.adapter_training_wrapper][INFO] - Adapter 'critic_adapter': initialized with fresh weights (no initial weights found). [2026-03-25 15:38:00,821][__main__][INFO] - Starting iteration 0. [2026-03-25 15:38:00,837][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:38:00,837][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:38:05,004][__main__][INFO] - Number of regex retries in iteration 0: 0 [2026-03-25 15:38:05,005][__main__][INFO] - agents played in iteration 0 are Alice, Bob [2026-03-25 15:38:05,638][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:38:06,199][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:38:06,836][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:38:07,151][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:38:07,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:38:07,789][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:38:08,109][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:38:08,428][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:38:08,747][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:38:09,066][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:38:09,386][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:38:09,705][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:38:10,024][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:38:10,343][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:38:10,663][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:38:10,983][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:38:11,302][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:38:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:38:11,941][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:38:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:38:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:38:12,900][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:38:13,220][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:38:13,539][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:38:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:38:14,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:38:14,498][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:38:14,820][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:38:15,139][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:38:15,459][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:38:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:38:16,098][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:38:16,417][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:38:16,737][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:38:17,058][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:38:17,378][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:38:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:38:18,017][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:38:18,337][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:38:18,656][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:38:18,976][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:38:19,295][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:38:19,615][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:38:19,935][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:38:20,254][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:38:20,574][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:38:20,893][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:38:21,502][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:38:21,823][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:38:22,142][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:38:22,462][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:38:22,781][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:38:23,100][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:38:23,419][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:38:23,739][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:38:24,058][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:38:24,378][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:38:24,698][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:38:25,017][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:38:25,336][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:38:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:38:25,976][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:38:26,296][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:38:26,615][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:38:26,934][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:38:27,255][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:38:27,905][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 8.34%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 25.84%, ΔTime: 00:00:21 [2026-03-25 15:38:28,526][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:38:28,529][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:38:28,530][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:38:29,250][__main__][INFO] - Iteration 1 took 28s (14.67% Gen, 82.79% Train). Generation: 4s, Training: 23s. Estimated remaining time: 7h 50m 9s. Estimated total time: 7h 53m 34s. Time estimates for 10 more iterations: 4m 44s, 100 more iterations: 47m 21s, 500 more iterations: 3h 56m 47s. [2026-03-25 15:38:29,254][__main__][INFO] - Starting iteration 1. [2026-03-25 15:38:29,257][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:38:29,258][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:38:32,643][__main__][INFO] - Number of regex retries in iteration 1: 0 [2026-03-25 15:38:32,644][__main__][INFO] - agents played in iteration 1 are Alice, Bob [2026-03-25 15:38:33,301][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:38:33,947][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:38:34,239][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:38:34,561][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:38:34,881][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:38:35,200][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:38:35,520][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:38:35,839][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:38:36,159][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:38:36,479][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:38:36,798][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:38:37,119][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:38:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:38:37,759][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:38:38,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:38:38,400][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:38:38,720][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:38:39,039][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:38:39,358][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:38:39,678][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:38:39,998][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:38:40,319][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:38:40,639][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:38:40,960][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:38:41,280][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:38:41,600][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:38:41,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:38:42,238][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:38:42,558][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:38:42,879][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:38:43,198][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:38:43,518][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:38:43,838][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:38:44,158][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:38:44,477][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:38:44,797][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:38:45,117][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:38:45,436][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:38:45,758][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:38:46,078][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:38:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:38:46,718][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:38:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:38:47,359][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:38:47,678][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:38:47,998][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:38:48,318][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:38:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:38:48,957][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:38:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:38:49,597][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:38:49,918][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:38:50,238][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:38:50,888][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:38:51,208][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:38:51,528][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:38:51,847][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:38:52,167][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:38:52,487][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:38:52,806][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:38:53,125][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:38:53,446][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:38:53,766][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:38:54,085][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:38:54,405][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:38:54,724][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:38:55,365][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:38:56,064][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:38:56,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:38:56,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:38:56,729][__main__][INFO] - Iteration 2 took 27s (12.32% Gen, 85.26% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 34m 0s. Estimated total time: 7h 37m 53s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 47s, 500 more iterations: 3h 48m 56s. [2026-03-25 15:38:56,731][__main__][INFO] - Starting iteration 2. [2026-03-25 15:38:56,735][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:38:56,735][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:39:00,482][__main__][INFO] - Number of regex retries in iteration 2: 0 [2026-03-25 15:39:00,483][__main__][INFO] - agents played in iteration 2 are Alice, Bob [2026-03-25 15:39:01,063][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:39:01,741][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:39:02,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:39:02,353][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:39:02,673][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:39:02,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:39:03,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:39:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:39:03,951][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:39:04,271][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:39:04,590][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:39:04,911][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:39:05,230][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:39:05,550][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:39:05,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:39:06,189][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:39:06,508][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:39:06,827][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:39:07,148][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:39:07,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:39:07,788][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:39:08,107][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:39:08,427][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:39:08,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:39:09,069][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:39:09,389][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:39:09,709][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:39:10,029][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:39:10,350][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:39:10,669][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:39:10,989][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:39:11,310][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:39:11,629][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:39:11,948][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:39:12,268][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:39:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:39:12,909][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:39:13,229][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:39:13,549][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:39:13,868][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:39:14,188][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:39:14,506][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:39:14,826][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:39:15,145][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:39:15,465][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:39:15,785][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:39:16,104][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:39:16,423][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:39:16,743][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:39:17,064][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:39:17,384][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:39:17,705][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:39:18,025][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:39:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:39:18,983][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:39:19,304][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:39:19,624][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:39:19,943][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:39:20,263][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:39:20,583][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:39:20,905][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:39:21,225][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:39:21,545][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:39:21,865][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:39:22,185][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:39:22,505][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:39:23,147][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:39:23,846][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:39:23,849][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:39:23,850][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:39:24,513][__main__][INFO] - Iteration 3 took 27s (13.49% Gen, 84.12% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 38m 39s. Estimated total time: 7h 42m 59s. Time estimates for 10 more iterations: 4m 37s, 100 more iterations: 46m 17s, 500 more iterations: 3h 51m 29s. [2026-03-25 15:39:24,515][__main__][INFO] - Starting iteration 3. [2026-03-25 15:39:24,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:39:24,519][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:39:27,831][__main__][INFO] - Number of regex retries in iteration 3: 0 [2026-03-25 15:39:27,832][__main__][INFO] - agents played in iteration 3 are Alice, Bob [2026-03-25 15:39:28,472][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:39:29,155][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:39:29,446][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:39:29,767][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:39:30,088][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:39:30,409][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:39:30,729][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:39:31,048][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:39:31,368][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:39:31,689][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:39:32,009][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:39:32,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:39:32,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:39:32,970][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:39:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:39:33,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:39:33,935][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:39:34,256][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:39:34,577][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:39:34,897][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:39:35,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:39:35,538][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:39:35,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:39:36,180][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:39:36,500][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:39:36,821][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:39:37,141][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:39:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:39:37,781][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:39:38,101][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:39:38,421][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:39:38,740][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:39:39,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:39:39,380][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:39:39,700][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:39:40,020][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:39:40,341][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:39:40,661][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:39:40,981][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:39:41,300][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:39:41,620][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:39:41,940][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:39:42,261][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:39:42,581][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:39:42,900][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:39:43,222][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:39:43,541][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:39:43,861][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:39:44,181][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:39:44,502][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:39:44,822][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:39:45,141][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:39:45,461][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:39:46,075][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:39:46,395][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:39:46,716][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:39:47,036][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:39:47,356][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:39:47,676][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:39:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:39:48,317][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:39:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:39:48,958][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:39:49,279][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:39:49,600][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:39:49,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:39:50,568][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:39:51,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:39:51,270][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:39:51,271][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:39:51,933][__main__][INFO] - Iteration 4 took 27s (12.08% Gen, 85.49% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 32m 8s. Estimated total time: 7h 36m 56s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 28s. [2026-03-25 15:39:51,935][__main__][INFO] - Starting iteration 4. [2026-03-25 15:39:51,939][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:39:51,940][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:39:55,413][__main__][INFO] - Number of regex retries in iteration 4: 0 [2026-03-25 15:39:55,414][__main__][INFO] - agents played in iteration 4 are Alice, Bob [2026-03-25 15:39:56,094][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:39:56,746][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:39:57,038][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:39:57,360][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:39:57,680][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:39:58,001][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:39:58,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:39:58,641][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:39:58,960][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:39:59,280][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:39:59,600][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:39:59,920][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:40:00,240][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:40:00,561][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:40:00,880][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:40:01,201][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:40:01,521][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:40:01,840][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:40:02,161][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:40:02,482][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:40:02,802][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:40:03,123][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:40:03,443][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:40:03,763][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:40:04,083][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:40:04,403][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:40:04,723][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:40:05,045][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:40:05,365][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:40:05,685][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:40:06,005][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:40:06,325][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:40:06,645][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:40:06,965][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:40:07,285][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:40:07,605][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:40:07,924][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:40:08,245][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:40:08,566][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:40:08,886][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:40:09,206][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:40:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:40:09,846][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:40:10,167][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:40:10,487][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:40:10,807][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:40:11,128][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:40:11,448][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:40:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:40:12,089][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:40:12,409][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:40:12,729][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:40:13,049][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:40:13,686][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:40:14,009][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:40:14,329][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:40:14,650][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:40:14,969][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:40:15,290][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:40:15,610][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:40:15,930][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:40:16,250][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:40:16,570][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:40:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:40:17,210][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:40:17,530][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:40:18,174][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:40:18,876][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:40:18,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:40:18,879][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:40:19,543][__main__][INFO] - Iteration 5 took 27s (12.59% Gen, 85.00% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 34m 50s. Estimated total time: 7h 40m 5s. Time estimates for 10 more iterations: 4m 36s, 100 more iterations: 46m 0s, 500 more iterations: 3h 50m 2s. [2026-03-25 15:40:19,546][__main__][INFO] - Starting iteration 5. [2026-03-25 15:40:19,549][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:40:19,549][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:40:22,977][__main__][INFO] - Number of regex retries in iteration 5: 0 [2026-03-25 15:40:22,978][__main__][INFO] - agents played in iteration 5 are Alice, Bob [2026-03-25 15:40:23,592][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:40:24,244][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:40:24,535][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:40:24,856][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:40:25,177][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:40:25,497][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:40:25,817][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:40:26,137][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:40:26,456][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:40:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:40:27,096][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:40:27,416][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:40:27,735][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:40:28,055][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:40:28,375][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:40:28,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:40:29,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:40:29,333][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:40:29,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:40:29,973][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:40:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:40:30,612][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:40:30,932][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:40:31,251][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:40:31,571][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:40:31,890][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:40:32,209][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:40:32,529][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:40:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:40:33,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:40:33,490][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:40:33,809][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:40:34,128][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:40:34,448][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:40:34,767][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:40:35,087][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:40:35,407][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:40:35,726][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:40:36,046][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:40:36,365][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:40:36,685][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:40:37,005][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:40:37,324][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:40:37,644][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:40:37,963][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:40:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:40:38,603][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:40:38,922][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:40:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:40:39,562][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:40:39,881][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:40:40,201][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:40:40,522][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:40:41,159][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:40:41,479][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:40:41,799][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:40:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:40:42,438][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:40:42,758][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:40:43,079][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:40:43,398][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:40:43,718][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:40:44,038][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:40:44,358][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:40:44,678][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:40:44,998][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:40:45,696][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:40:46,394][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:40:46,397][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:40:46,398][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:40:47,058][__main__][INFO] - Iteration 6 took 27s (12.46% Gen, 85.13% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 32m 48s. Estimated total time: 7h 38m 30s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 51s, 500 more iterations: 3h 49m 15s. [2026-03-25 15:40:47,060][__main__][INFO] - Starting iteration 6. [2026-03-25 15:40:47,063][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:40:47,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:40:50,358][__main__][INFO] - Number of regex retries in iteration 6: 0 [2026-03-25 15:40:50,359][__main__][INFO] - agents played in iteration 6 are Alice, Bob [2026-03-25 15:40:51,032][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:40:51,672][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:40:52,416][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:40:52,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:40:53,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:40:53,378][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:40:53,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:40:54,021][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:40:54,341][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:40:54,662][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:40:54,983][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:40:55,303][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:40:55,624][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:40:55,944][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:40:56,264][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:40:56,584][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:40:56,905][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:40:57,225][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:40:57,546][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:40:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:40:58,186][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:40:58,507][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:40:58,826][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:40:59,147][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:40:59,468][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:40:59,788][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:41:00,107][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:41:00,428][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:41:00,749][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:41:01,070][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:41:01,390][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:41:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:41:02,029][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:41:02,348][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:41:02,668][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:41:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:41:03,309][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:41:03,629][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:41:03,948][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:41:04,269][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:41:04,588][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:41:04,907][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:41:05,227][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:41:05,548][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:41:05,869][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:41:06,190][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:41:06,511][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:41:06,830][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:41:07,149][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:41:07,468][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:41:07,787][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:41:08,108][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:41:08,427][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:41:09,034][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:41:09,354][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:41:09,674][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:41:09,995][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:41:10,314][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:41:10,635][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:41:10,956][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:41:11,276][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:41:11,596][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:41:11,916][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:41:12,237][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:41:12,557][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:41:12,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:41:13,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:41:14,213][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:41:14,215][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:41:14,217][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:41:14,874][__main__][INFO] - Iteration 7 took 27s (11.85% Gen, 85.78% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 37m 20s. Estimated total time: 7h 43m 31s. Time estimates for 10 more iterations: 4m 38s, 100 more iterations: 46m 21s, 500 more iterations: 3h 51m 45s. [2026-03-25 15:41:14,876][__main__][INFO] - Starting iteration 7. [2026-03-25 15:41:14,879][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:41:14,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:41:18,289][__main__][INFO] - Number of regex retries in iteration 7: 0 [2026-03-25 15:41:18,289][__main__][INFO] - agents played in iteration 7 are Alice, Bob [2026-03-25 15:41:18,924][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:41:19,569][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:41:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:41:20,181][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:41:20,501][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:41:20,821][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:41:21,141][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:41:21,460][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:41:21,780][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:41:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:41:22,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:41:22,740][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:41:23,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:41:23,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:41:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:41:24,024][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:41:24,344][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:41:24,664][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:41:24,984][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:41:25,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:41:25,624][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:41:25,943][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:41:26,263][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:41:26,582][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:41:26,903][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:41:27,225][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:41:27,544][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:41:27,864][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:41:28,184][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:41:28,505][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:41:28,824][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:41:29,144][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:41:29,464][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:41:29,784][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:41:30,104][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:41:30,424][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:41:30,744][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:41:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:41:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:41:31,703][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:41:32,022][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:41:32,341][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:41:32,661][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:41:32,982][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:41:33,302][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:41:33,621][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:41:33,940][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:41:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:41:34,579][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:41:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:41:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:41:35,537][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:41:35,858][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:41:36,466][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:41:36,786][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:41:37,105][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:41:37,425][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:41:37,745][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:41:38,067][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:41:38,387][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:41:38,707][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:41:39,026][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:41:39,346][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:41:39,666][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:41:39,986][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:41:40,307][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:41:40,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:41:41,643][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:41:41,646][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:41:41,647][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:41:42,301][__main__][INFO] - Iteration 8 took 27s (12.43% Gen, 85.17% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 30m 25s. Estimated total time: 7h 37m 3s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 31s. [2026-03-25 15:41:42,304][__main__][INFO] - Starting iteration 8. [2026-03-25 15:41:42,307][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:41:42,307][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:41:45,645][__main__][INFO] - Number of regex retries in iteration 8: 0 [2026-03-25 15:41:45,646][__main__][INFO] - agents played in iteration 8 are Alice, Bob [2026-03-25 15:41:46,335][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:41:46,980][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:41:47,271][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:41:47,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:41:47,912][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:41:48,234][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:41:48,555][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:41:48,875][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:41:49,196][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:41:49,516][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:41:49,837][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:41:50,158][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:41:50,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:41:50,800][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:41:51,120][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:41:51,441][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:41:51,762][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:41:52,083][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:41:52,404][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:41:52,724][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:41:53,044][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:41:53,364][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:41:53,684][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:41:54,003][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:41:54,324][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:41:54,644][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:41:54,966][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:41:55,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:41:55,607][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:41:55,927][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:41:56,247][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:41:56,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:41:56,887][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:41:57,209][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:41:57,529][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:41:57,849][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:41:58,169][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:41:58,490][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:41:58,810][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:41:59,130][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:41:59,450][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:41:59,769][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:42:00,090][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:42:00,410][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:42:00,731][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:42:01,051][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:42:01,371][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:42:01,691][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:42:02,012][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:42:02,332][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:42:02,652][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:42:02,972][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:42:03,293][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:42:03,900][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:42:04,220][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:42:04,540][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:42:04,860][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:42:05,181][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:42:05,501][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:42:05,821][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:42:06,141][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:42:06,462][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:42:06,783][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:42:07,105][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:42:07,426][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:42:07,746][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:42:08,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:42:09,094][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:42:09,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:42:09,098][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:42:09,746][__main__][INFO] - Iteration 9 took 27s (12.17% Gen, 85.47% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 30m 15s. Estimated total time: 7h 37m 20s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 40s. [2026-03-25 15:42:09,748][__main__][INFO] - Starting iteration 9. [2026-03-25 15:42:09,751][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:42:09,752][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:42:13,121][__main__][INFO] - Number of regex retries in iteration 9: 0 [2026-03-25 15:42:13,122][__main__][INFO] - agents played in iteration 9 are Alice, Bob [2026-03-25 15:42:13,680][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:42:14,336][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:42:14,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:42:14,952][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:42:15,273][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:42:15,593][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:42:15,914][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:42:16,234][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:42:16,555][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:42:16,878][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:42:17,200][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:42:17,521][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:42:17,842][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:42:18,162][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:42:18,483][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:42:18,803][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:42:19,125][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:42:19,446][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:42:19,767][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:42:20,086][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:42:20,405][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:42:20,725][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:42:21,045][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:42:21,365][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:42:21,685][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:42:22,006][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:42:22,325][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:42:22,646][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:42:22,966][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:42:23,286][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:42:23,605][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:42:23,925][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:42:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:42:24,565][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:42:24,885][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:42:25,205][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:42:25,524][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:42:25,844][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:42:26,165][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:42:26,485][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:42:26,804][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:42:27,126][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:42:27,446][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:42:27,766][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:42:28,086][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:42:28,406][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:42:28,726][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:42:29,047][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:42:29,367][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:42:29,687][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:42:30,007][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:42:30,326][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:42:30,647][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:42:31,260][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:42:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:42:31,900][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:42:32,219][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:42:32,540][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:42:32,860][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:42:33,181][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:42:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:42:33,820][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:42:34,139][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:42:34,459][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:42:34,778][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:42:35,099][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:42:35,757][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:42:36,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:42:36,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:42:36,488][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:42:37,143][__main__][INFO] - Iteration 10 took 27s (12.30% Gen, 85.30% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 28m 59s. Estimated total time: 7h 36m 32s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 16s. [2026-03-25 15:42:37,145][__main__][INFO] - Starting iteration 10. [2026-03-25 15:42:37,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:42:37,149][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:42:40,557][__main__][INFO] - Number of regex retries in iteration 10: 0 [2026-03-25 15:42:40,558][__main__][INFO] - agents played in iteration 10 are Alice, Bob [2026-03-25 15:42:41,113][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:42:41,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:42:42,133][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:42:42,454][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:42:42,774][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:42:43,094][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:42:43,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:42:43,734][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:42:44,054][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:42:44,374][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:42:44,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:42:45,015][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:42:45,336][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:42:45,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:42:45,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:42:46,298][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:42:46,619][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:42:46,940][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:42:47,261][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:42:47,582][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:42:47,903][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:42:48,223][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:42:48,543][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:42:48,864][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:42:49,184][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:42:49,503][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:42:49,824][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:42:50,144][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:42:50,463][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:42:50,784][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:42:51,103][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:42:51,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:42:51,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:42:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:42:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:42:52,703][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:42:53,023][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:42:53,343][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:42:53,663][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:42:53,984][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:42:54,303][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:42:54,623][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:42:54,942][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:42:55,263][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:42:55,584][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:42:55,903][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:42:56,222][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:42:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:42:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:42:57,183][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:42:57,504][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:42:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:42:58,143][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:42:58,758][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:42:59,079][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:42:59,398][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:42:59,719][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:43:00,039][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:43:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:43:00,680][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:43:01,001][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:43:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:43:01,641][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:43:01,960][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:43:02,280][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:43:02,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:43:03,252][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:43:03,954][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:43:03,956][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:43:03,958][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:43:04,623][__main__][INFO] - Iteration 11 took 27s (12.41% Gen, 85.17% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 29m 55s. Estimated total time: 7h 37m 55s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 47s, 500 more iterations: 3h 48m 57s. [2026-03-25 15:43:04,626][__main__][INFO] - Starting iteration 11. [2026-03-25 15:43:04,629][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:43:04,629][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:43:08,725][__main__][INFO] - Number of regex retries in iteration 11: 0 [2026-03-25 15:43:08,726][__main__][INFO] - agents played in iteration 11 are Alice, Bob [2026-03-25 15:43:09,267][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:43:09,931][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:43:10,223][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:43:10,544][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:43:10,865][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:43:11,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:43:11,505][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:43:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:43:12,144][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:43:12,465][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:43:12,785][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:43:13,106][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:43:13,426][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:43:13,746][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:43:14,067][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:43:14,387][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:43:14,707][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:43:15,026][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:43:15,346][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:43:15,666][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:43:15,985][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:43:16,305][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:43:16,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:43:16,946][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:43:17,266][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:43:17,585][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:43:17,906][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:43:18,225][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:43:18,545][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:43:18,866][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:43:19,185][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:43:19,505][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:43:19,825][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:43:20,145][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:43:20,465][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:43:20,786][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:43:21,105][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:43:21,424][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:43:21,744][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:43:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:43:22,384][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:43:22,705][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:43:23,026][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:43:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:43:23,666][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:43:23,985][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:43:24,304][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:43:24,624][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:43:24,944][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:43:25,263][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:43:25,582][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:43:25,902][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:43:26,221][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:43:26,842][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:43:27,164][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:43:27,484][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:43:27,803][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:43:28,123][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:43:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:43:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:43:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:43:29,404][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:43:29,724][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:43:30,044][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:43:30,363][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:43:30,683][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:43:31,326][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:43:32,027][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:43:32,029][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:43:32,031][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:43:32,690][__main__][INFO] - Iteration 12 took 28s (14.60% Gen, 83.05% Train). Generation: 4s, Training: 23s. Estimated remaining time: 7h 39m 14s. Estimated total time: 7h 47m 42s. Time estimates for 10 more iterations: 4m 40s, 100 more iterations: 46m 46s, 500 more iterations: 3h 53m 51s. [2026-03-25 15:43:32,692][__main__][INFO] - Starting iteration 12. [2026-03-25 15:43:32,695][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:43:32,696][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:43:36,054][__main__][INFO] - Number of regex retries in iteration 12: 0 [2026-03-25 15:43:36,055][__main__][INFO] - agents played in iteration 12 are Alice, Bob [2026-03-25 15:43:36,597][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:43:37,241][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:43:37,533][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:43:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:43:38,173][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:43:38,493][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:43:38,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:43:39,133][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:43:39,454][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:43:39,773][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:43:40,093][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:43:40,413][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:43:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:43:41,054][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:43:41,374][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:43:41,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:43:42,013][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:43:42,335][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:43:42,655][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:43:42,975][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:43:43,294][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:43:43,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:43:43,934][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:43:44,255][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:43:44,574][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:43:44,895][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:43:45,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:43:45,535][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:43:45,855][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:43:46,174][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:43:46,495][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:43:46,815][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:43:47,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:43:47,454][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:43:47,774][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:43:48,094][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:43:48,413][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:43:48,733][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:43:49,054][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:43:49,373][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:43:49,693][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:43:50,012][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:43:50,332][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:43:50,652][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:43:50,972][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:43:51,291][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:43:51,612][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:43:51,932][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:43:52,252][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:43:52,571][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:43:52,891][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:43:53,210][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:43:53,529][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:43:54,135][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:43:54,455][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:43:54,774][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:43:55,094][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:43:55,413][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:43:55,733][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:43:56,053][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:43:56,372][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:43:56,692][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:43:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:43:57,332][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:43:57,652][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:43:57,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:43:58,608][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:43:59,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:43:59,313][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:43:59,315][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:43:59,974][__main__][INFO] - Iteration 13 took 27s (12.31% Gen, 85.26% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 25m 44s. Estimated total time: 7h 34m 40s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 20s. [2026-03-25 15:43:59,976][__main__][INFO] - Starting iteration 13. [2026-03-25 15:43:59,980][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:43:59,980][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:44:03,281][__main__][INFO] - Number of regex retries in iteration 13: 0 [2026-03-25 15:44:03,282][__main__][INFO] - agents played in iteration 13 are Alice, Bob [2026-03-25 15:44:03,824][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:44:04,468][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:44:04,761][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:44:05,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:44:05,402][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:44:05,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:44:06,043][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:44:06,363][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:44:06,682][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:44:07,002][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:44:07,322][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:44:07,642][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:44:07,962][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:44:08,282][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:44:08,602][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:44:08,922][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:44:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:44:09,562][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:44:09,882][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:44:10,202][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:44:10,522][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:44:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:44:11,162][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:44:11,482][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:44:11,803][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:44:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:44:12,443][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:44:12,763][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:44:13,083][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:44:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:44:13,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:44:14,044][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:44:14,364][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:44:14,684][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:44:15,003][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:44:15,323][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:44:15,643][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:44:15,964][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:44:16,284][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:44:16,604][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:44:16,924][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:44:17,244][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:44:17,565][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:44:17,884][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:44:18,204][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:44:18,524][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:44:18,844][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:44:19,163][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:44:19,484][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:44:19,804][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:44:20,124][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:44:20,444][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:44:20,764][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:44:21,371][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:44:21,691][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:44:22,010][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:44:22,331][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:44:22,652][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:44:22,972][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:44:23,292][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:44:23,612][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:44:23,933][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:44:24,252][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:44:24,572][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:44:24,893][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:44:25,213][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:44:25,850][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:44:26,551][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:44:26,553][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:44:26,555][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:44:27,202][__main__][INFO] - Iteration 14 took 27s (12.13% Gen, 85.49% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 24m 20s. Estimated total time: 7h 33m 43s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 22s, 500 more iterations: 3h 46m 51s. [2026-03-25 15:44:27,204][__main__][INFO] - Starting iteration 14. [2026-03-25 15:44:27,207][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:44:27,208][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:44:30,483][__main__][INFO] - Number of regex retries in iteration 14: 0 [2026-03-25 15:44:30,484][__main__][INFO] - agents played in iteration 14 are Alice, Bob [2026-03-25 15:44:31,050][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:44:31,698][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:44:31,992][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:44:32,313][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:44:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:44:32,953][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:44:33,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:44:33,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:44:33,915][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:44:34,236][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:44:34,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:44:34,877][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:44:35,197][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:44:35,517][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:44:35,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:44:36,157][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:44:36,477][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:44:36,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:44:37,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:44:37,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:44:37,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:44:38,080][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:44:38,400][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:44:38,721][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:44:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:44:39,361][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:44:39,682][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:44:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:44:40,322][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:44:40,643][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:44:40,963][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:44:41,285][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:44:41,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:44:41,926][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:44:42,248][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:44:42,568][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:44:42,889][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:44:43,209][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:44:43,530][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:44:43,850][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:44:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:44:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:44:44,811][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:44:45,131][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:44:45,451][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:44:45,771][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:44:46,092][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:44:46,411][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:44:46,732][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:44:47,051][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:44:47,371][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:44:47,690][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:44:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:44:48,620][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:44:48,941][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:44:49,261][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:44:49,582][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:44:49,902][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:44:50,221][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:44:50,542][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:44:50,862][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:44:51,181][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:44:51,502][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:44:51,822][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:44:52,143][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:44:52,464][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:44:53,106][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:44:53,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:44:53,816][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:44:53,818][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:44:54,472][__main__][INFO] - Iteration 15 took 27s (12.01% Gen, 85.58% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 24m 35s. Estimated total time: 7h 34m 25s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 12s. [2026-03-25 15:44:54,474][__main__][INFO] - Starting iteration 15. [2026-03-25 15:44:54,477][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:44:54,478][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:44:57,823][__main__][INFO] - Number of regex retries in iteration 15: 0 [2026-03-25 15:44:57,824][__main__][INFO] - agents played in iteration 15 are Alice, Bob [2026-03-25 15:44:58,411][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:44:59,080][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:44:59,373][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:44:59,694][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:45:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:45:00,335][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:45:00,655][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:45:00,975][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:45:01,296][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:45:01,616][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:45:01,936][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:45:02,257][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:45:02,578][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:45:02,898][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:45:03,219][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:45:03,541][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:45:03,862][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:45:04,182][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:45:04,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:45:04,823][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:45:05,144][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:45:05,464][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:45:05,785][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:45:06,105][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:45:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:45:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:45:07,065][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:45:07,384][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:45:07,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:45:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:45:08,345][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:45:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:45:08,987][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:45:09,307][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:45:09,626][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:45:09,947][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:45:10,267][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:45:10,587][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:45:10,907][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:45:11,227][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:45:11,548][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:45:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:45:12,189][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:45:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:45:12,829][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:45:13,149][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:45:13,469][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:45:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:45:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:45:14,430][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:45:14,749][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:45:15,070][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:45:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:45:16,009][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:45:16,330][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:45:16,650][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:45:16,971][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:45:17,291][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:45:17,611][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:45:17,932][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:45:18,252][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:45:18,572][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:45:18,892][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:45:19,212][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:45:19,532][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:45:19,853][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:45:20,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:45:21,218][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:45:21,220][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:45:21,221][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:45:21,871][__main__][INFO] - Iteration 16 took 27s (12.21% Gen, 85.41% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 26m 17s. Estimated total time: 7h 36m 34s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 17s. [2026-03-25 15:45:21,873][__main__][INFO] - Starting iteration 16. [2026-03-25 15:45:21,876][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:45:21,877][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:45:25,224][__main__][INFO] - Number of regex retries in iteration 16: 0 [2026-03-25 15:45:25,225][__main__][INFO] - agents played in iteration 16 are Alice, Bob [2026-03-25 15:45:25,817][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:45:26,483][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:45:26,775][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:45:27,098][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:45:27,420][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:45:27,740][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:45:28,061][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:45:28,381][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:45:28,702][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:45:29,022][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:45:29,342][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:45:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:45:29,981][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:45:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:45:30,621][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:45:30,941][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:45:31,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:45:31,581][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:45:31,901][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:45:32,221][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:45:32,541][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:45:32,861][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:45:33,182][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:45:33,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:45:33,822][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:45:34,142][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:45:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:45:34,781][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:45:35,101][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:45:35,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:45:35,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:45:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:45:36,385][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:45:36,705][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:45:37,026][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:45:37,348][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:45:37,667][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:45:37,987][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:45:38,309][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:45:38,629][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:45:38,949][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:45:39,268][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:45:39,588][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:45:39,908][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:45:40,228][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:45:40,548][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:45:40,868][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:45:41,188][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:45:41,509][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:45:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:45:42,149][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:45:42,469][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:45:42,789][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:45:43,408][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:45:43,728][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:45:44,048][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:45:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:45:44,688][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:45:45,008][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:45:45,328][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:45:45,648][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:45:45,968][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:45:46,288][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:45:46,609][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:45:46,930][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:45:47,249][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:45:47,910][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:45:48,616][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:45:48,618][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:45:48,620][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:45:49,274][__main__][INFO] - Iteration 17 took 27s (12.22% Gen, 85.39% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 25m 53s. Estimated total time: 7h 36m 38s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 19s. [2026-03-25 15:45:49,276][__main__][INFO] - Starting iteration 17. [2026-03-25 15:45:49,279][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:45:49,280][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:45:52,673][__main__][INFO] - Number of regex retries in iteration 17: 0 [2026-03-25 15:45:52,674][__main__][INFO] - agents played in iteration 17 are Alice, Bob [2026-03-25 15:45:53,254][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:45:53,919][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:45:54,210][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:45:54,532][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:45:54,852][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:45:55,173][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:45:55,493][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:45:55,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:45:56,134][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:45:56,454][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:45:56,775][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:45:57,097][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:45:57,418][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:45:57,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:45:58,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:45:58,380][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:45:58,701][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:45:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:45:59,341][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:45:59,661][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:45:59,984][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:46:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:46:00,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:46:00,948][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:46:01,269][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:46:01,590][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:46:01,909][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:46:02,230][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:46:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:46:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:46:03,191][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:46:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:46:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:46:04,151][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:46:04,472][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:46:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:46:05,115][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:46:05,435][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:46:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:46:06,075][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:46:06,395][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:46:06,716][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:46:07,036][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:46:07,357][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:46:07,678][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:46:07,998][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:46:08,319][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:46:08,640][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:46:08,961][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:46:09,282][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:46:09,602][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:46:09,923][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:46:10,243][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:46:10,862][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:46:11,183][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:46:11,504][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:46:11,825][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:46:12,146][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:46:12,467][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:46:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:46:13,108][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:46:13,429][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:46:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:46:14,073][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:46:14,392][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:46:14,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:46:15,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:46:16,089][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:46:16,091][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:46:16,093][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:46:16,703][__main__][INFO] - Iteration 18 took 27s (12.38% Gen, 85.39% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 25m 52s. Estimated total time: 7h 37m 4s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 32s. [2026-03-25 15:46:16,705][__main__][INFO] - Starting iteration 18. [2026-03-25 15:46:16,708][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:46:16,709][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:46:20,037][__main__][INFO] - Number of regex retries in iteration 18: 0 [2026-03-25 15:46:20,038][__main__][INFO] - agents played in iteration 18 are Alice, Bob [2026-03-25 15:46:20,623][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:46:21,279][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:46:21,571][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:46:21,892][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:46:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:46:22,534][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:46:22,854][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:46:23,174][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:46:23,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:46:23,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:46:24,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:46:24,454][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:46:24,775][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:46:25,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:46:25,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:46:25,736][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:46:26,055][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:46:26,377][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:46:26,698][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:46:27,019][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:46:27,340][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:46:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:46:27,983][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:46:28,303][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:46:28,622][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:46:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:46:29,262][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:46:29,582][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:46:29,903][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:46:30,224][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:46:30,544][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:46:30,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:46:31,184][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:46:31,504][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:46:31,824][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:46:32,145][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:46:32,466][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:46:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:46:33,107][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:46:33,428][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:46:33,749][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:46:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:46:34,390][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:46:34,710][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:46:35,031][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:46:35,351][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:46:35,673][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:46:35,994][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:46:36,314][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:46:36,633][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:46:36,953][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:46:37,274][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:46:37,594][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:46:38,207][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:46:38,528][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:46:38,849][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:46:39,169][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:46:39,490][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:46:39,812][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:46:40,133][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:46:40,454][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:46:40,774][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:46:41,094][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:46:41,415][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:46:41,735][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:46:42,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:46:42,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:46:43,423][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:46:43,426][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:46:43,428][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:46:44,070][__main__][INFO] - Iteration 19 took 27s (12.17% Gen, 85.48% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 24m 22s. Estimated total time: 7h 36m 2s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 36s, 500 more iterations: 3h 48m 1s. [2026-03-25 15:46:44,072][__main__][INFO] - Starting iteration 19. [2026-03-25 15:46:44,075][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:46:44,075][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:46:47,434][__main__][INFO] - Number of regex retries in iteration 19: 0 [2026-03-25 15:46:47,435][__main__][INFO] - agents played in iteration 19 are Alice, Bob [2026-03-25 15:46:48,039][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:46:48,699][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:46:48,990][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:46:49,312][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:46:49,632][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:46:49,953][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:46:50,274][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:46:50,594][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:46:50,915][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:46:51,236][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:46:51,557][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:46:51,878][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:46:52,200][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:46:52,521][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:46:52,841][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:46:53,162][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:46:53,482][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:46:53,804][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:46:54,126][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:46:54,446][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:46:54,767][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:46:55,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:46:55,408][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:46:55,729][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:46:56,050][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:46:56,370][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:46:56,690][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:46:57,010][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:46:57,331][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:46:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:46:57,973][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:46:58,293][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:46:58,614][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:46:58,935][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:46:59,255][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:46:59,577][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:46:59,897][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:47:00,218][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:47:00,538][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:47:00,859][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:47:01,182][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:47:01,502][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:47:01,823][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:47:02,144][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:47:02,465][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:47:02,785][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:47:03,106][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:47:03,426][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:47:03,747][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:47:04,067][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:47:04,388][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:47:04,709][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:47:05,030][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:47:05,649][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:47:05,970][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:47:06,290][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:47:06,611][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:47:06,931][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:47:07,252][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:47:07,573][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:47:07,894][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:47:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:47:08,535][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:47:08,856][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:47:09,177][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:47:09,498][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:47:10,151][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:47:10,865][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:47:10,867][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:47:10,868][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:47:11,506][__main__][INFO] - Iteration 20 took 27s (12.25% Gen, 85.42% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 25m 5s. Estimated total time: 7h 37m 12s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 43s, 500 more iterations: 3h 48m 36s. [2026-03-25 15:47:11,508][__main__][INFO] - Starting iteration 20. [2026-03-25 15:47:11,511][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:47:11,512][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:47:14,841][__main__][INFO] - Number of regex retries in iteration 20: 0 [2026-03-25 15:47:14,842][__main__][INFO] - agents played in iteration 20 are Alice, Bob [2026-03-25 15:47:15,444][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:47:16,095][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:47:16,387][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:47:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:47:17,028][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:47:17,349][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:47:17,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:47:17,989][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:47:18,310][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:47:18,631][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:47:18,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:47:19,273][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:47:19,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:47:19,914][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:47:20,236][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:47:20,557][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:47:20,878][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:47:21,200][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:47:21,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:47:21,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:47:22,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:47:22,485][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:47:22,805][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:47:23,125][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:47:23,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:47:23,767][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:47:24,088][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:47:24,409][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:47:24,730][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:47:25,050][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:47:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:47:25,692][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:47:26,013][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:47:26,334][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:47:26,655][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:47:26,975][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:47:27,297][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:47:27,617][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:47:27,938][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:47:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:47:28,580][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:47:28,900][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:47:29,221][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:47:29,541][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:47:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:47:30,182][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:47:30,503][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:47:30,823][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:47:31,144][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:47:31,465][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:47:31,787][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:47:32,108][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:47:32,429][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:47:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:47:33,367][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:47:33,688][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:47:34,011][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:47:34,333][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:47:34,653][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:47:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:47:35,296][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:47:35,617][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:47:35,937][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:47:36,258][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:47:36,579][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:47:36,901][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:47:37,547][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:47:38,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:47:38,264][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:47:38,265][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:47:38,905][__main__][INFO] - Iteration 21 took 27s (12.15% Gen, 85.50% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 24m 0s. Estimated total time: 7h 36m 34s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 17s. [2026-03-25 15:47:38,907][__main__][INFO] - Starting iteration 21. [2026-03-25 15:47:38,910][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:47:38,911][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:47:42,248][__main__][INFO] - Number of regex retries in iteration 21: 0 [2026-03-25 15:47:42,248][__main__][INFO] - agents played in iteration 21 are Alice, Bob [2026-03-25 15:47:42,845][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:47:43,504][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:47:43,795][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:47:44,116][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:47:44,437][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:47:44,758][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:47:45,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:47:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:47:45,723][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:47:46,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:47:46,365][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:47:46,686][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:47:47,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:47:47,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:47:47,649][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:47:47,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:47:48,291][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:47:48,612][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:47:48,933][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:47:49,254][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:47:49,575][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:47:49,895][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:47:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:47:50,536][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:47:50,857][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:47:51,179][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:47:51,499][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:47:51,819][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:47:52,140][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:47:52,461][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:47:52,782][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:47:53,102][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:47:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:47:53,742][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:47:54,063][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:47:54,383][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:47:54,704][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:47:55,024][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:47:55,345][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:47:55,666][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:47:55,987][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:47:56,310][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:47:56,630][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:47:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:47:57,272][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:47:57,593][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:47:57,914][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:47:58,234][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:47:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:47:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:47:59,196][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:47:59,517][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:47:59,837][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:48:00,451][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:48:00,772][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:48:01,093][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:48:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:48:01,733][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:48:02,054][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:48:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:48:02,697][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:48:03,018][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:48:03,340][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:48:03,661][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:48:03,982][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:48:04,303][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:48:04,956][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:48:05,667][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:48:05,669][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:48:05,671][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:48:06,308][__main__][INFO] - Iteration 22 took 27s (12.18% Gen, 85.48% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 23m 37s. Estimated total time: 7h 36m 39s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 19s. [2026-03-25 15:48:06,310][__main__][INFO] - Starting iteration 22. [2026-03-25 15:48:06,314][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:48:06,314][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:48:09,655][__main__][INFO] - Number of regex retries in iteration 22: 0 [2026-03-25 15:48:09,656][__main__][INFO] - agents played in iteration 22 are Alice, Bob [2026-03-25 15:48:10,247][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:48:10,911][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:48:11,202][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:48:11,523][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:48:11,844][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:48:12,165][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:48:12,484][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:48:12,804][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:48:13,124][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:48:13,444][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:48:13,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:48:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:48:14,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:48:14,727][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:48:15,048][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:48:15,369][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:48:15,690][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:48:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:48:16,330][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:48:16,651][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:48:16,972][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:48:17,293][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:48:17,613][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:48:17,934][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:48:18,254][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:48:18,576][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:48:18,896][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:48:19,217][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:48:19,539][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:48:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:48:20,183][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:48:20,504][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:48:20,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:48:21,144][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:48:21,464][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:48:21,785][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:48:22,105][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:48:22,426][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:48:22,747][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:48:23,067][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:48:23,388][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:48:23,709][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:48:24,029][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:48:24,349][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:48:24,670][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:48:24,990][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:48:25,311][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:48:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:48:25,953][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:48:26,274][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:48:26,595][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:48:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:48:27,237][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:48:27,852][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:48:28,174][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:48:28,494][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:48:28,815][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:48:29,135][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:48:29,457][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:48:29,778][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:48:30,099][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:48:30,420][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:48:30,741][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:48:31,062][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:48:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:48:31,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:48:32,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:48:33,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:48:33,068][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:48:33,069][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:48:33,709][__main__][INFO] - Iteration 23 took 27s (12.20% Gen, 85.46% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 23m 7s. Estimated total time: 7h 36m 36s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 18s. [2026-03-25 15:48:33,712][__main__][INFO] - Starting iteration 23. [2026-03-25 15:48:33,715][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:48:33,715][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:48:37,075][__main__][INFO] - Number of regex retries in iteration 23: 0 [2026-03-25 15:48:37,076][__main__][INFO] - agents played in iteration 23 are Alice, Bob [2026-03-25 15:48:37,686][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:48:38,347][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:48:38,639][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:48:38,962][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:48:39,282][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:48:39,603][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:48:39,923][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:48:40,244][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:48:40,565][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:48:40,887][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:48:41,208][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:48:41,528][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:48:41,849][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:48:42,170][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:48:42,490][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:48:42,811][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:48:43,131][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:48:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:48:43,772][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:48:44,094][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:48:44,415][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:48:44,737][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:48:45,059][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:48:45,380][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:48:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:48:46,023][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:48:46,345][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:48:46,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:48:46,987][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:48:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:48:47,632][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:48:47,952][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:48:48,274][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:48:48,596][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:48:48,917][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:48:49,238][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:48:49,559][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:48:49,880][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:48:50,202][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:48:50,523][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:48:50,844][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:48:51,166][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:48:51,489][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:48:51,811][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:48:52,132][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:48:52,454][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:48:52,777][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:48:53,098][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:48:53,420][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:48:53,743][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:48:54,064][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:48:54,386][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:48:54,706][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:48:55,327][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:48:55,648][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:48:55,969][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:48:56,289][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:48:56,610][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:48:56,931][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:48:57,251][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:48:57,572][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:48:57,893][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:48:58,214][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:48:58,535][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:48:58,857][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:48:59,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:48:59,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:49:00,548][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:49:00,551][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:49:00,552][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:49:01,196][__main__][INFO] - Iteration 24 took 27s (12.23% Gen, 85.42% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 24m 5s. Estimated total time: 7h 38m 2s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 48s, 500 more iterations: 3h 49m 1s. [2026-03-25 15:49:01,198][__main__][INFO] - Starting iteration 24. [2026-03-25 15:49:01,201][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:49:01,202][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:49:04,530][__main__][INFO] - Number of regex retries in iteration 24: 0 [2026-03-25 15:49:04,530][__main__][INFO] - agents played in iteration 24 are Alice, Bob [2026-03-25 15:49:05,127][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:49:05,791][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:49:06,084][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:49:06,406][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:49:06,726][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:49:07,047][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:49:07,369][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:49:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:49:08,009][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:49:08,329][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:49:08,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:49:08,971][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:49:09,292][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:49:09,613][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:49:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:49:10,253][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:49:10,573][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:49:10,895][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:49:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:49:11,537][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:49:11,857][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:49:12,179][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:49:12,500][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:49:12,821][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:49:13,141][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:49:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:49:13,784][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:49:14,106][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:49:14,427][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:49:14,748][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:49:15,069][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:49:15,389][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:49:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:49:16,031][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:49:16,351][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:49:16,672][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:49:16,992][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:49:17,314][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:49:17,635][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:49:17,956][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:49:18,278][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:49:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:49:18,920][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:49:19,242][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:49:19,562][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:49:19,883][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:49:20,204][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:49:20,524][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:49:20,844][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:49:21,164][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:49:21,485][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:49:21,806][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:49:22,127][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:49:22,746][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:49:23,068][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:49:23,389][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:49:23,710][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:49:24,032][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:49:24,353][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:49:24,673][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:49:24,994][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:49:25,315][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:49:25,636][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:49:25,958][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:49:26,278][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:49:26,600][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:49:27,259][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:49:27,981][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:49:27,983][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:49:27,985][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:49:28,629][__main__][INFO] - Iteration 25 took 27s (12.14% Gen, 85.51% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 22m 44s. Estimated total time: 7h 37m 9s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 34s. [2026-03-25 15:49:28,631][__main__][INFO] - Starting iteration 25. [2026-03-25 15:49:28,634][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:49:28,635][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:49:31,972][__main__][INFO] - Number of regex retries in iteration 25: 0 [2026-03-25 15:49:31,973][__main__][INFO] - agents played in iteration 25 are Alice, Bob [2026-03-25 15:49:32,568][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:49:33,231][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:49:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:49:33,844][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:49:34,163][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:49:34,484][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:49:34,805][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:49:35,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:49:35,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:49:35,768][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:49:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:49:36,410][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:49:36,731][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:49:37,052][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:49:37,372][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:49:37,693][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:49:38,014][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:49:38,335][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:49:38,656][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:49:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:49:39,297][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:49:39,618][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:49:39,938][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:49:40,259][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:49:40,579][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:49:40,899][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:49:41,221][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:49:41,542][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:49:41,864][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:49:42,184][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:49:42,504][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:49:42,825][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:49:43,145][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:49:43,466][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:49:43,787][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:49:44,108][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:49:44,429][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:49:44,750][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:49:45,072][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:49:45,393][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:49:45,713][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:49:46,035][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:49:46,355][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:49:46,676][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:49:46,997][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:49:47,318][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:49:47,639][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:49:47,960][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:49:48,281][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:49:48,602][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:49:48,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:49:49,243][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:49:49,564][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:49:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:49:50,502][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:49:50,823][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:49:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:49:51,463][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:49:51,785][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:49:52,106][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:49:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:49:52,748][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:49:53,070][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:49:53,391][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:49:53,711][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:49:54,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:49:54,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:49:55,402][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:49:55,405][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:49:55,406][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:49:56,045][__main__][INFO] - Iteration 26 took 27s (12.18% Gen, 85.49% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 21m 59s. Estimated total time: 7h 36m 51s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 25s. [2026-03-25 15:49:56,047][__main__][INFO] - Starting iteration 26. [2026-03-25 15:49:56,050][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:49:56,051][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:49:59,411][__main__][INFO] - Number of regex retries in iteration 26: 0 [2026-03-25 15:49:59,412][__main__][INFO] - agents played in iteration 26 are Alice, Bob [2026-03-25 15:50:00,005][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:50:00,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:50:00,957][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:50:01,280][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:50:01,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:50:01,922][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:50:02,242][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:50:02,564][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:50:02,886][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:50:03,208][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:50:03,528][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:50:03,851][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:50:04,172][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:50:04,494][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:50:04,816][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:50:05,136][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:50:05,458][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:50:05,782][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:50:06,103][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:50:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:50:06,745][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:50:07,066][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:50:07,388][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:50:07,709][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:50:08,030][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:50:08,350][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:50:08,671][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:50:08,992][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:50:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:50:09,634][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:50:09,955][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:50:10,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:50:10,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:50:10,917][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:50:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:50:11,562][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:50:11,883][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:50:12,205][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:50:12,525][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:50:12,847][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:50:13,170][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:50:13,492][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:50:13,815][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:50:14,136][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:50:14,456][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:50:14,777][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:50:15,098][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:50:15,419][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:50:15,739][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:50:16,061][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:50:16,384][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:50:16,704][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:50:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:50:17,645][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:50:17,967][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:50:18,288][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:50:18,610][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:50:18,931][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:50:19,252][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:50:19,573][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:50:19,893][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:50:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:50:20,535][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:50:20,857][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:50:21,179][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:50:21,501][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:50:22,155][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:50:22,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:50:22,868][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:50:22,870][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:50:23,518][__main__][INFO] - Iteration 27 took 27s (12.24% Gen, 85.40% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 22m 29s. Estimated total time: 7h 37m 48s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 46s, 500 more iterations: 3h 48m 54s. [2026-03-25 15:50:23,520][__main__][INFO] - Starting iteration 27. [2026-03-25 15:50:23,523][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:50:23,524][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:50:26,920][__main__][INFO] - Number of regex retries in iteration 27: 0 [2026-03-25 15:50:26,921][__main__][INFO] - agents played in iteration 27 are Alice, Bob [2026-03-25 15:50:27,509][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:50:28,169][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:50:28,461][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:50:28,783][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:50:29,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:50:29,424][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:50:29,745][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:50:30,066][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:50:30,387][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:50:30,708][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:50:31,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:50:31,350][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:50:31,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:50:31,994][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:50:32,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:50:32,636][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:50:32,956][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:50:33,278][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:50:33,599][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:50:33,921][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:50:34,242][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:50:34,564][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:50:34,885][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:50:35,206][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:50:35,528][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:50:35,848][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:50:36,169][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:50:36,490][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:50:36,810][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:50:37,132][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:50:37,453][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:50:37,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:50:38,096][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:50:38,418][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:50:38,739][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:50:39,060][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:50:39,381][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:50:39,702][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:50:40,024][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:50:40,345][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:50:40,666][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:50:40,987][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:50:41,308][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:50:41,629][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:50:41,949][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:50:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:50:42,593][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:50:42,913][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:50:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:50:43,556][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:50:43,878][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:50:44,199][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:50:44,521][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:50:45,138][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:50:45,460][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:50:45,781][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:50:46,102][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:50:46,423][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:50:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:50:47,064][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:50:47,385][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:50:47,706][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:50:48,027][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:50:48,347][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:50:48,669][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:50:48,991][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:50:49,645][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:50:50,356][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:50:50,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:50:50,360][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:50:50,997][__main__][INFO] - Iteration 28 took 27s (12.36% Gen, 85.31% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 22m 8s. Estimated total time: 7h 37m 55s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 47s, 500 more iterations: 3h 48m 57s. [2026-03-25 15:50:50,999][__main__][INFO] - Starting iteration 28. [2026-03-25 15:50:51,003][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:50:51,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:50:54,362][__main__][INFO] - Number of regex retries in iteration 28: 0 [2026-03-25 15:50:54,363][__main__][INFO] - agents played in iteration 28 are Alice, Bob [2026-03-25 15:50:54,958][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:50:55,620][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:50:55,912][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:50:56,234][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:50:56,555][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:50:56,876][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:50:57,199][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:50:57,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:50:57,843][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:50:58,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:50:58,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:50:58,807][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:50:59,129][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:50:59,450][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:50:59,771][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:51:00,093][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:51:00,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:51:00,736][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:51:01,056][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:51:01,377][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:51:01,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:51:02,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:51:02,340][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:51:02,660][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:51:02,982][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:51:03,303][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:51:03,625][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:51:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:51:04,267][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:51:04,588][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:51:04,908][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:51:05,229][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:51:05,551][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:51:05,872][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:51:06,194][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:51:06,515][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:51:06,836][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:51:07,156][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:51:07,478][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:51:07,799][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:51:08,120][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:51:08,441][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:51:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:51:09,083][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:51:09,404][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:51:09,725][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:51:10,047][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:51:10,368][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:51:10,689][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:51:11,010][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:51:11,331][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:51:11,652][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:51:11,972][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:51:12,589][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:51:12,910][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:51:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:51:13,552][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:51:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:51:14,195][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:51:14,515][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:51:14,836][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:51:15,158][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:51:15,479][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:51:15,801][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:51:16,121][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:51:16,443][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:51:17,098][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:51:17,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:51:17,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:51:17,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:51:18,463][__main__][INFO] - Iteration 29 took 27s (12.23% Gen, 85.40% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 21m 27s. Estimated total time: 7h 37m 41s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 46s, 500 more iterations: 3h 48m 50s. [2026-03-25 15:51:18,465][__main__][INFO] - Starting iteration 29. [2026-03-25 15:51:18,468][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:51:18,469][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:51:21,830][__main__][INFO] - Number of regex retries in iteration 29: 0 [2026-03-25 15:51:21,831][__main__][INFO] - agents played in iteration 29 are Alice, Bob [2026-03-25 15:51:22,427][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:51:23,089][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:51:23,381][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:51:23,704][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:51:24,026][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:51:24,347][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:51:24,669][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:51:24,990][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:51:25,311][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:51:25,632][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:51:25,952][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:51:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:51:26,594][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:51:26,915][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:51:27,236][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:51:27,556][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:51:27,878][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:51:28,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:51:28,520][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:51:28,842][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:51:29,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:51:29,484][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:51:29,805][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:51:30,127][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:51:30,448][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:51:30,770][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:51:31,092][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:51:31,414][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:51:31,735][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:51:32,057][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:51:32,379][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:51:32,701][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:51:33,023][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:51:33,344][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:51:33,665][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:51:33,986][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:51:34,307][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:51:34,628][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:51:34,950][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:51:35,271][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:51:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:51:35,915][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:51:36,236][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:51:36,557][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:51:36,878][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:51:37,200][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:51:37,520][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:51:37,841][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:51:38,163][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:51:38,485][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:51:38,806][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:51:39,128][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:51:39,449][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:51:40,067][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:51:40,389][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:51:40,710][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:51:41,030][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:51:41,351][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:51:41,673][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:51:41,994][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:51:42,315][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:51:42,637][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:51:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:51:43,280][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:51:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:51:43,922][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:51:44,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:51:45,285][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:51:45,288][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:51:45,289][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:51:45,931][__main__][INFO] - Iteration 30 took 27s (12.24% Gen, 85.42% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 21m 2s. Estimated total time: 7h 37m 44s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 46s, 500 more iterations: 3h 48m 52s. [2026-03-25 15:51:45,933][__main__][INFO] - Starting iteration 30. [2026-03-25 15:51:45,936][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:51:45,937][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:51:49,275][__main__][INFO] - Number of regex retries in iteration 30: 0 [2026-03-25 15:51:49,276][__main__][INFO] - agents played in iteration 30 are Alice, Bob [2026-03-25 15:51:49,895][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:51:50,558][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:51:50,850][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:51:51,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:51:51,494][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:51:51,815][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:51:52,136][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:51:52,457][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:51:52,778][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:51:53,099][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:51:53,420][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:51:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:51:54,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:51:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:51:54,705][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:51:55,025][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:51:55,346][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:51:55,668][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:51:55,989][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:51:56,309][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:51:56,630][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:51:56,951][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:51:57,273][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:51:57,593][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:51:57,915][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:51:58,236][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:51:58,557][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:51:58,878][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:51:59,199][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:51:59,520][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:51:59,841][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:52:00,163][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:52:00,485][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:52:00,806][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:52:01,127][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:52:01,448][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:52:01,770][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:52:02,090][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:52:02,411][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:52:02,733][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:52:03,054][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:52:03,375][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:52:03,695][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:52:04,016][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:52:04,337][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:52:04,657][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:52:04,979][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:52:05,301][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:52:05,622][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:52:05,945][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:52:06,266][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:52:06,587][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:52:06,909][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:52:07,526][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:52:07,848][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:52:08,169][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:52:08,491][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:52:08,812][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:52:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:52:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:52:09,776][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:52:10,096][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:52:10,417][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:52:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:52:11,057][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:52:11,378][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:52:12,034][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:52:12,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:52:12,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:52:12,754][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:52:13,394][__main__][INFO] - Iteration 31 took 27s (12.16% Gen, 85.50% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 20m 30s. Estimated total time: 7h 37m 39s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 45s, 500 more iterations: 3h 48m 49s. [2026-03-25 15:52:13,397][__main__][INFO] - Starting iteration 31. [2026-03-25 15:52:13,400][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:52:13,401][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:52:16,588][__main__][INFO] - Number of regex retries in iteration 31: 0 [2026-03-25 15:52:16,589][__main__][INFO] - agents played in iteration 31 are Alice, Bob [2026-03-25 15:52:17,171][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:52:17,834][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:52:18,126][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:52:18,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:52:18,769][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:52:19,089][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:52:19,409][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:52:19,730][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:52:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:52:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:52:20,693][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:52:21,015][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:52:21,335][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:52:21,657][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:52:21,978][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:52:22,299][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:52:22,620][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:52:22,942][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:52:23,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:52:23,584][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:52:23,905][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:52:24,226][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:52:24,546][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:52:24,867][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:52:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:52:25,509][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:52:25,830][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:52:26,151][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:52:26,472][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:52:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:52:27,114][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:52:27,436][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:52:27,757][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:52:28,078][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:52:28,399][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:52:28,720][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:52:29,041][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:52:29,362][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:52:29,683][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:52:30,003][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:52:30,325][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:52:30,646][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:52:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:52:31,289][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:52:31,609][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:52:31,930][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:52:32,251][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:52:32,572][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:52:32,892][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:52:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:52:33,534][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:52:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:52:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:52:34,793][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:52:35,115][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:52:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:52:35,757][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:52:36,079][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:52:36,401][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:52:36,722][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:52:37,043][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:52:37,364][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:52:37,684][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:52:38,005][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:52:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:52:38,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:52:39,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:52:40,023][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:52:40,025][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:52:40,027][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:52:40,665][__main__][INFO] - Iteration 32 took 27s (11.69% Gen, 85.96% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 16m 50s. Estimated total time: 7h 34m 26s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 13s. [2026-03-25 15:52:40,668][__main__][INFO] - Starting iteration 32. [2026-03-25 15:52:40,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:52:40,672][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:52:43,886][__main__][INFO] - Number of regex retries in iteration 32: 0 [2026-03-25 15:52:43,887][__main__][INFO] - agents played in iteration 32 are Alice, Bob [2026-03-25 15:52:44,478][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:52:45,142][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:52:45,434][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:52:45,755][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:52:46,076][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:52:46,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:52:46,719][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:52:47,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:52:47,360][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:52:47,680][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:52:48,001][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:52:48,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:52:48,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:52:48,963][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:52:49,284][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:52:49,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:52:49,927][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:52:50,249][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:52:50,571][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:52:50,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:52:51,215][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:52:51,538][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:52:51,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:52:52,182][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:52:52,503][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:52:52,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:52:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:52:53,466][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:52:53,787][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:52:54,109][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:52:54,430][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:52:54,751][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:52:55,072][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:52:55,393][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:52:55,714][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:52:56,034][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:52:56,354][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:52:56,675][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:52:56,995][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:52:57,317][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:52:57,638][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:52:57,960][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:52:58,281][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:52:58,603][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:52:58,924][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:52:59,246][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:52:59,567][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:52:59,888][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:53:00,208][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:53:00,529][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:53:00,849][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:53:01,170][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:53:01,491][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:53:02,111][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:53:02,433][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:53:02,754][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:53:03,077][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:53:03,399][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:53:03,720][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:53:04,042][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:53:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:53:04,686][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:53:05,007][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:53:05,329][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:53:05,651][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:53:05,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:53:06,632][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:53:07,347][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:53:07,349][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:53:07,350][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:53:07,989][__main__][INFO] - Iteration 33 took 27s (11.77% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 17m 15s. Estimated total time: 7h 35m 19s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 39s. [2026-03-25 15:53:07,991][__main__][INFO] - Starting iteration 33. [2026-03-25 15:53:07,994][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:53:07,995][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:53:11,207][__main__][INFO] - Number of regex retries in iteration 33: 0 [2026-03-25 15:53:11,208][__main__][INFO] - agents played in iteration 33 are Alice, Bob [2026-03-25 15:53:11,802][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:53:12,467][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:53:12,759][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:53:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:53:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:53:13,726][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:53:14,047][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:53:14,369][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:53:14,690][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:53:15,011][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:53:15,332][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:53:15,653][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:53:15,973][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:53:16,294][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:53:16,616][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:53:16,937][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:53:17,259][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:53:17,580][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:53:17,901][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:53:18,223][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:53:18,544][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:53:18,865][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:53:19,187][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:53:19,508][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:53:19,829][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:53:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:53:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:53:20,793][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:53:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:53:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:53:21,757][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:53:22,078][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:53:22,398][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:53:22,719][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:53:23,040][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:53:23,361][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:53:23,682][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:53:24,003][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:53:24,325][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:53:24,646][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:53:24,968][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:53:25,289][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:53:25,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:53:25,932][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:53:26,252][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:53:26,574][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:53:26,895][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:53:27,216][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:53:27,537][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:53:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:53:28,179][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:53:28,500][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:53:28,820][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:53:29,438][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:53:29,760][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:53:30,081][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:53:30,402][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:53:30,724][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:53:31,045][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:53:31,366][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:53:31,688][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:53:32,009][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:53:32,330][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:53:32,652][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:53:32,974][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:53:33,295][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:53:33,964][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:53:34,680][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:53:34,682][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:53:34,683][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:53:35,323][__main__][INFO] - Iteration 34 took 27s (11.76% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 16m 58s. Estimated total time: 7h 35m 29s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 32s, 500 more iterations: 3h 47m 44s. [2026-03-25 15:53:35,325][__main__][INFO] - Starting iteration 34. [2026-03-25 15:53:35,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:53:35,329][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:53:38,540][__main__][INFO] - Number of regex retries in iteration 34: 0 [2026-03-25 15:53:38,541][__main__][INFO] - agents played in iteration 34 are Alice, Bob [2026-03-25 15:53:39,144][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:53:39,809][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:53:40,101][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:53:40,423][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:53:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:53:41,066][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:53:41,387][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:53:41,709][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:53:42,030][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:53:42,351][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:53:42,672][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:53:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:53:43,315][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:53:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:53:43,958][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:53:44,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:53:44,600][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:53:44,920][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:53:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:53:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:53:45,884][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:53:46,205][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:53:46,527][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:53:46,849][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:53:47,169][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:53:47,491][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:53:47,812][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:53:48,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:53:48,455][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:53:48,777][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:53:49,098][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:53:49,420][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:53:49,741][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:53:50,063][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:53:50,385][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:53:50,707][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:53:51,028][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:53:51,349][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:53:51,670][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:53:51,990][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:53:52,311][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:53:52,633][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:53:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:53:53,275][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:53:53,596][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:53:53,917][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:53:54,239][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:53:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:53:54,883][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:53:55,204][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:53:55,525][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:53:55,846][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:53:56,168][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:53:56,787][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:53:57,109][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:53:57,431][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:53:57,751][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:53:58,073][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:53:58,393][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:53:58,714][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:53:59,035][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:53:59,356][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:53:59,677][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:53:59,999][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:54:00,320][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:54:00,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:54:01,300][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:54:02,020][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:54:02,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:54:02,024][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:54:02,666][__main__][INFO] - Iteration 35 took 27s (11.75% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 16m 40s. Estimated total time: 7h 35m 39s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 49s. [2026-03-25 15:54:02,669][__main__][INFO] - Starting iteration 35. [2026-03-25 15:54:02,672][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:54:02,672][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:54:05,873][__main__][INFO] - Number of regex retries in iteration 35: 0 [2026-03-25 15:54:05,873][__main__][INFO] - agents played in iteration 35 are Alice, Bob [2026-03-25 15:54:06,475][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:54:07,140][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:54:07,431][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:54:07,753][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:54:08,073][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:54:08,394][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:54:08,715][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:54:09,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:54:09,356][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:54:09,677][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:54:09,999][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:54:10,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:54:10,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:54:10,964][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:54:11,286][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:54:11,606][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:54:11,929][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:54:12,250][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:54:12,573][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:54:12,894][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:54:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:54:13,537][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:54:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:54:14,180][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:54:14,501][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:54:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:54:15,144][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:54:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:54:15,787][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:54:16,110][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:54:16,432][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:54:16,754][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:54:17,076][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:54:17,397][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:54:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:54:18,039][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:54:18,360][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:54:18,681][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:54:19,003][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:54:19,325][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:54:19,646][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:54:19,968][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:54:20,289][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:54:20,610][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:54:20,932][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:54:21,253][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:54:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:54:21,896][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:54:22,217][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:54:22,538][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:54:22,859][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:54:23,182][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:54:23,503][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:54:24,122][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:54:24,444][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:54:24,765][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:54:25,086][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:54:25,407][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:54:25,728][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:54:26,048][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:54:26,370][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:54:26,690][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:54:27,012][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:54:27,334][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:54:27,655][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:54:27,976][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:54:28,634][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:54:29,348][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:54:29,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:54:29,352][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:54:29,989][__main__][INFO] - Iteration 36 took 27s (11.72% Gen, 85.94% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 15m 52s. Estimated total time: 7h 35m 18s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 39s. [2026-03-25 15:54:29,991][__main__][INFO] - Starting iteration 36. [2026-03-25 15:54:29,994][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:54:29,995][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:54:33,184][__main__][INFO] - Number of regex retries in iteration 36: 0 [2026-03-25 15:54:33,185][__main__][INFO] - agents played in iteration 36 are Alice, Bob [2026-03-25 15:54:33,786][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:54:34,452][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:54:34,743][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:54:35,065][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:54:35,388][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:54:35,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:54:36,032][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:54:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:54:36,675][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:54:36,996][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:54:37,317][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:54:37,638][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:54:37,959][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:54:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:54:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:54:38,923][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:54:39,244][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:54:39,565][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:54:39,887][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:54:40,208][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:54:40,530][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:54:40,852][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:54:41,174][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:54:41,496][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:54:41,818][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:54:42,140][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:54:42,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:54:42,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:54:43,103][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:54:43,425][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:54:43,745][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:54:44,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:54:44,388][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:54:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:54:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:54:45,351][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:54:45,673][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:54:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:54:46,315][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:54:46,637][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:54:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:54:47,279][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:54:47,600][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:54:47,921][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:54:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:54:48,563][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:54:48,884][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:54:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:54:49,525][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:54:49,846][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:54:50,166][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:54:50,487][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:54:50,809][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:54:51,423][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:54:51,745][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:54:52,066][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:54:52,387][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:54:52,708][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:54:53,029][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:54:53,349][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:54:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:54:53,991][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:54:54,311][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:54:54,632][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:54:54,954][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:54:55,275][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:54:55,927][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:54:56,648][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:54:56,651][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:54:56,652][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:54:57,294][__main__][INFO] - Iteration 37 took 27s (11.69% Gen, 85.96% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 15m 8s. Estimated total time: 7h 35m 0s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 30s, 500 more iterations: 3h 47m 30s. [2026-03-25 15:54:57,296][__main__][INFO] - Starting iteration 37. [2026-03-25 15:54:57,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:54:57,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:55:00,498][__main__][INFO] - Number of regex retries in iteration 37: 0 [2026-03-25 15:55:00,498][__main__][INFO] - agents played in iteration 37 are Alice, Bob [2026-03-25 15:55:01,082][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:55:01,740][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:55:02,032][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:55:02,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:55:02,676][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:55:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:55:03,319][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:55:03,641][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:55:03,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:55:04,282][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:55:04,603][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:55:04,925][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:55:05,246][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:55:05,568][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:55:05,890][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:55:06,211][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:55:06,532][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:55:06,853][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:55:07,174][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:55:07,496][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:55:07,817][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:55:08,138][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:55:08,459][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:55:08,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:55:09,102][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:55:09,423][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:55:09,745][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:55:10,067][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:55:10,388][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:55:10,710][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:55:11,031][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:55:11,352][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:55:11,674][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:55:11,995][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:55:12,316][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:55:12,638][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:55:12,960][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:55:13,282][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:55:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:55:13,924][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:55:14,245][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:55:14,567][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:55:14,888][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:55:15,210][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:55:15,531][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:55:15,851][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:55:16,173][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:55:16,494][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:55:16,816][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:55:17,137][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:55:17,459][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:55:17,780][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:55:18,101][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:55:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:55:19,037][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:55:19,359][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:55:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:55:20,002][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:55:20,324][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:55:20,646][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:55:20,968][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:55:21,289][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:55:21,610][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:55:21,933][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:55:22,254][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:55:22,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:55:23,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:55:23,955][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:55:23,957][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:55:23,959][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:55:24,599][__main__][INFO] - Iteration 38 took 27s (11.72% Gen, 85.93% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 14m 40s. Estimated total time: 7h 35m 0s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 30s, 500 more iterations: 3h 47m 30s. [2026-03-25 15:55:24,601][__main__][INFO] - Starting iteration 38. [2026-03-25 15:55:24,604][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:55:24,605][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:55:27,820][__main__][INFO] - Number of regex retries in iteration 38: 0 [2026-03-25 15:55:27,821][__main__][INFO] - agents played in iteration 38 are Alice, Bob [2026-03-25 15:55:28,428][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:55:29,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:55:29,380][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:55:29,703][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:55:30,025][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:55:30,347][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:55:30,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:55:30,992][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:55:31,313][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:55:31,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:55:31,955][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:55:32,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:55:32,597][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:55:32,918][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:55:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:55:33,561][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:55:33,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:55:34,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:55:34,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:55:34,847][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:55:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:55:35,489][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:55:35,810][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:55:36,131][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:55:36,453][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:55:36,774][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:55:37,096][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:55:37,417][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:55:37,738][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:55:38,058][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:55:38,379][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:55:38,701][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:55:39,023][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:55:39,344][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:55:39,665][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:55:39,986][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:55:40,308][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:55:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:55:40,950][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:55:41,272][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:55:41,593][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:55:41,915][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:55:42,237][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:55:42,559][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:55:42,881][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:55:43,203][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:55:43,524][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:55:43,845][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:55:44,167][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:55:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:55:44,810][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:55:45,131][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:55:45,452][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:55:46,068][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:55:46,389][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:55:46,710][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:55:47,031][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:55:47,353][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:55:47,674][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:55:47,995][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:55:48,317][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:55:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:55:48,961][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:55:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:55:49,604][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:55:49,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:55:50,579][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:55:51,297][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:55:51,299][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:55:51,301][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:55:51,947][__main__][INFO] - Iteration 39 took 27s (11.76% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 14m 56s. Estimated total time: 7h 35m 44s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 34s, 500 more iterations: 3h 47m 52s. [2026-03-25 15:55:51,949][__main__][INFO] - Starting iteration 39. [2026-03-25 15:55:51,952][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:55:51,953][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:55:55,136][__main__][INFO] - Number of regex retries in iteration 39: 0 [2026-03-25 15:55:55,137][__main__][INFO] - agents played in iteration 39 are Alice, Bob [2026-03-25 15:55:55,744][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:55:56,403][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:55:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:55:57,016][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:55:57,338][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:55:57,659][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:55:57,980][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:55:58,301][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:55:58,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:55:58,943][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:55:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:55:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:55:59,906][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:56:00,228][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:56:00,549][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:56:00,871][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:56:01,193][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:56:01,514][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:56:01,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:56:02,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:56:02,476][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:56:02,798][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:56:03,119][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:56:03,441][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:56:03,763][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:56:04,084][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:56:04,406][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:56:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:56:05,048][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:56:05,370][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:56:05,692][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:56:06,015][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:56:06,337][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:56:06,659][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:56:06,980][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:56:07,302][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:56:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:56:07,945][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:56:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:56:08,586][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:56:08,908][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:56:09,231][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:56:09,552][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:56:09,873][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:56:10,196][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:56:10,518][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:56:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:56:11,161][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:56:11,483][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:56:11,805][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:56:12,125][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:56:12,446][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:56:12,769][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:56:13,385][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:56:13,706][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:56:14,028][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:56:14,349][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:56:14,669][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:56:14,991][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:56:15,313][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:56:15,634][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:56:15,956][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:56:16,278][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:56:16,599][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:56:16,920][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:56:17,242][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:56:17,894][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:56:18,612][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:56:18,614][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:56:18,616][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:56:19,262][__main__][INFO] - Iteration 40 took 27s (11.66% Gen, 85.97% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 13m 56s. Estimated total time: 7h 35m 11s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 35s. [2026-03-25 15:56:19,265][__main__][INFO] - Starting iteration 40. [2026-03-25 15:56:19,268][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:56:19,268][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:56:22,480][__main__][INFO] - Number of regex retries in iteration 40: 0 [2026-03-25 15:56:22,481][__main__][INFO] - agents played in iteration 40 are Alice, Bob [2026-03-25 15:56:23,090][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:56:23,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:56:24,039][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:56:24,362][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:56:24,683][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:56:25,004][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:56:25,325][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:56:25,646][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:56:25,966][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:56:26,287][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:56:26,608][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:56:26,929][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:56:27,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:56:27,571][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:56:27,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:56:28,214][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:56:28,536][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:56:28,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:56:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:56:29,500][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:56:29,822][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:56:30,143][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:56:30,465][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:56:30,786][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:56:31,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:56:31,429][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:56:31,751][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:56:32,072][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:56:32,394][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:56:32,715][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:56:33,037][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:56:33,358][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:56:33,679][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:56:34,000][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:56:34,322][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:56:34,643][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:56:34,964][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:56:35,286][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:56:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:56:35,929][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:56:36,250][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:56:36,571][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:56:36,892][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:56:37,213][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:56:37,534][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:56:37,854][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:56:38,175][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:56:38,496][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:56:38,819][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:56:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:56:39,461][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:56:39,783][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:56:40,104][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:56:40,720][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:56:41,041][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:56:41,362][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:56:41,683][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:56:42,004][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:56:42,325][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:56:42,647][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:56:42,968][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:56:43,290][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:56:43,610][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:56:43,932][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:56:44,254][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:56:44,575][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:56:45,228][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:56:45,951][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:56:45,953][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:56:45,955][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:56:46,598][__main__][INFO] - Iteration 41 took 27s (11.75% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 13m 49s. Estimated total time: 7h 35m 31s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 45s. [2026-03-25 15:56:46,600][__main__][INFO] - Starting iteration 41. [2026-03-25 15:56:46,603][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:56:46,604][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:56:49,807][__main__][INFO] - Number of regex retries in iteration 41: 0 [2026-03-25 15:56:49,807][__main__][INFO] - agents played in iteration 41 are Alice, Bob [2026-03-25 15:56:50,386][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:56:51,050][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:56:51,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:56:51,664][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:56:51,985][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:56:52,306][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:56:52,628][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:56:52,949][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:56:53,271][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:56:53,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:56:53,914][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:56:54,235][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:56:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:56:54,879][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:56:55,200][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:56:55,523][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:56:55,846][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:56:56,167][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:56:56,489][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:56:56,811][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:56:57,133][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:56:57,456][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:56:57,778][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:56:58,100][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:56:58,421][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:56:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:56:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:56:59,387][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:56:59,708][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:57:00,030][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:57:00,352][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:57:00,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:57:00,995][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:57:01,317][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:57:01,638][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:57:01,960][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:57:02,282][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:57:02,603][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:57:02,925][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:57:03,246][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:57:03,568][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:57:03,889][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:57:04,210][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:57:04,531][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:57:04,853][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:57:05,174][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:57:05,495][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:57:05,817][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:57:06,139][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:57:06,459][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:57:06,779][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:57:07,100][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:57:07,421][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:57:08,036][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:57:08,358][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:57:08,680][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:57:09,002][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:57:09,324][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:57:09,646][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:57:09,967][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:57:10,288][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:57:10,609][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:57:10,930][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:57:11,251][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:57:11,572][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:57:11,893][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:57:12,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:57:13,266][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:57:13,268][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:57:13,270][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:57:13,910][__main__][INFO] - Iteration 42 took 27s (11.73% Gen, 85.92% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 12m 58s. Estimated total time: 7h 35m 7s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 30s, 500 more iterations: 3h 47m 33s. [2026-03-25 15:57:13,912][__main__][INFO] - Starting iteration 42. [2026-03-25 15:57:13,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:57:13,916][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:57:17,138][__main__][INFO] - Number of regex retries in iteration 42: 0 [2026-03-25 15:57:17,139][__main__][INFO] - agents played in iteration 42 are Alice, Bob [2026-03-25 15:57:17,727][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:57:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:57:18,677][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:57:18,998][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:57:19,319][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:57:19,641][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:57:19,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:57:20,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:57:20,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:57:20,927][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:57:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:57:21,568][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:57:21,889][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:57:22,211][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:57:22,532][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:57:22,854][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:57:23,176][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:57:23,497][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:57:23,818][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:57:24,138][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:57:24,460][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:57:24,781][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:57:25,103][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:57:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:57:25,745][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:57:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:57:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:57:26,708][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:57:27,030][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:57:27,352][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:57:27,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:57:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:57:28,315][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:57:28,636][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:57:28,957][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:57:29,279][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:57:29,600][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:57:29,922][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:57:30,244][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:57:30,566][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:57:30,886][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:57:31,208][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:57:31,530][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:57:31,851][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:57:32,172][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:57:32,494][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:57:32,815][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:57:33,137][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:57:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:57:33,781][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:57:34,102][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:57:34,423][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:57:34,745][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:57:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:57:35,683][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:57:36,005][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:57:36,326][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:57:36,648][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:57:36,969][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:57:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:57:37,612][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:57:37,934][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:57:38,255][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:57:38,576][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:57:38,897][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:57:39,218][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:57:39,871][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:57:40,691][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:57:40,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:57:40,695][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:57:41,335][__main__][INFO] - Iteration 43 took 27s (11.75% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 14m 24s. Estimated total time: 7h 37m 1s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 30s. [2026-03-25 15:57:41,338][__main__][INFO] - Starting iteration 43. [2026-03-25 15:57:41,341][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:57:41,342][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:57:44,620][__main__][INFO] - Number of regex retries in iteration 43: 0 [2026-03-25 15:57:44,621][__main__][INFO] - agents played in iteration 43 are Alice, Bob [2026-03-25 15:57:45,218][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:57:45,877][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:57:46,169][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:57:46,492][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:57:46,814][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:57:47,135][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:57:47,456][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:57:47,777][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:57:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:57:48,420][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:57:48,742][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:57:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:57:49,386][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:57:49,708][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:57:50,029][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:57:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:57:50,670][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:57:50,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:57:51,312][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:57:51,634][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:57:51,954][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:57:52,275][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:57:52,595][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:57:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:57:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:57:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:57:53,880][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:57:54,201][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:57:54,523][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:57:54,844][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:57:55,166][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:57:55,487][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:57:55,810][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:57:56,132][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:57:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:57:56,774][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:57:57,095][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:57:57,416][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:57:57,737][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:57:58,059][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:57:58,380][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:57:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:57:59,023][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:57:59,344][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:57:59,664][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:57:59,985][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:58:00,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:58:00,627][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:58:00,949][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:58:01,271][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:58:01,594][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:58:01,916][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:58:02,238][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:58:02,855][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:58:03,177][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:58:03,498][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:58:03,819][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:58:04,141][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:58:04,465][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:58:04,786][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:58:05,106][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:58:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:58:05,748][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:58:06,069][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:58:06,391][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:58:06,711][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:58:07,363][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:58:08,091][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:58:08,093][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:58:08,095][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:58:08,745][__main__][INFO] - Iteration 44 took 27s (11.96% Gen, 85.65% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 13m 40s. Estimated total time: 7h 36m 45s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 22s. [2026-03-25 15:58:08,747][__main__][INFO] - Starting iteration 44. [2026-03-25 15:58:08,751][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:58:08,751][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:58:12,002][__main__][INFO] - Number of regex retries in iteration 44: 0 [2026-03-25 15:58:12,002][__main__][INFO] - agents played in iteration 44 are Alice, Bob [2026-03-25 15:58:12,598][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:58:13,258][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:58:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:58:13,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:58:14,194][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:58:14,514][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:58:14,836][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:58:15,157][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:58:15,477][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:58:15,798][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:58:16,120][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:58:16,442][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:58:16,764][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:58:17,085][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:58:17,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:58:17,729][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:58:18,090][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:58:18,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:58:18,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:58:19,054][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:58:19,375][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:58:19,697][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:58:20,019][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:58:20,340][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:58:20,662][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:58:20,983][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:58:21,305][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:58:21,626][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:58:21,947][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:58:22,268][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:58:22,588][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:58:22,909][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:58:23,231][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:58:23,552][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:58:23,873][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:58:24,194][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:58:24,515][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:58:24,837][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:58:25,159][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:58:25,481][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:58:25,803][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:58:26,123][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:58:26,444][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:58:26,766][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:58:27,088][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:58:27,410][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:58:27,732][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:58:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:58:28,376][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:58:28,697][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:58:29,019][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:58:29,341][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:58:29,663][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:58:30,277][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:58:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:58:30,920][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:58:31,241][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:58:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:58:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:58:32,206][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:58:32,528][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:58:32,849][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:58:33,817][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:58:34,140][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:58:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:58:34,783][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:58:35,441][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:22 [2026-03-25 15:58:36,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:58:36,275][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:58:36,276][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:58:36,924][__main__][INFO] - Iteration 45 took 28s (11.54% Gen, 86.16% Train). Generation: 3s, Training: 24s. Estimated remaining time: 7h 26m 1s. Estimated total time: 7h 49m 34s. Time estimates for 10 more iterations: 4m 41s, 100 more iterations: 46m 57s, 500 more iterations: 3h 54m 47s. [2026-03-25 15:58:36,927][__main__][INFO] - Starting iteration 45. [2026-03-25 15:58:36,930][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:58:36,931][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:58:40,141][__main__][INFO] - Number of regex retries in iteration 45: 0 [2026-03-25 15:58:40,142][__main__][INFO] - agents played in iteration 45 are Alice, Bob [2026-03-25 15:58:40,747][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:58:41,413][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:58:41,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:58:42,026][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:58:42,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:58:42,668][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:58:42,989][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:58:43,311][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:58:43,632][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:58:43,954][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:58:44,275][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:58:44,595][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:58:44,916][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:58:45,237][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:58:45,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:58:45,879][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:58:46,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:58:46,522][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:58:46,843][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:58:47,164][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:58:47,486][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:58:47,806][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:58:48,127][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:58:48,448][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:58:48,769][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:58:49,090][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:58:49,411][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:58:49,732][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:58:50,054][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:58:50,375][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:58:50,696][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:58:51,017][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:58:51,338][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:58:51,660][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:58:51,981][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:58:52,303][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:58:52,624][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:58:52,946][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:58:53,267][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:58:53,587][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:58:53,908][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:58:54,229][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:58:54,551][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:58:54,872][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:58:55,192][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:58:55,512][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:58:55,833][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:58:56,155][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:58:56,476][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:58:56,798][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:58:57,119][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:58:57,441][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:58:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:58:58,381][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:58:58,704][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:58:59,025][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:58:59,347][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:58:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:58:59,988][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:59:00,310][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:59:00,632][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:59:00,954][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:59:01,276][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:59:01,597][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:59:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:59:02,240][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:59:02,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:59:03,650][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:59:03,652][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:59:03,654][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:59:04,304][__main__][INFO] - Iteration 46 took 27s (11.73% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 12m 15s. Estimated total time: 7h 36m 15s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 7s. [2026-03-25 15:59:04,307][__main__][INFO] - Starting iteration 46. [2026-03-25 15:59:04,310][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:59:04,310][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:59:07,514][__main__][INFO] - Number of regex retries in iteration 46: 0 [2026-03-25 15:59:07,515][__main__][INFO] - agents played in iteration 46 are Alice, Bob [2026-03-25 15:59:08,100][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:59:08,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:59:09,057][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:59:09,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:59:09,700][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:59:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:59:10,343][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:59:10,664][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:59:10,986][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:59:11,307][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:59:11,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:59:11,949][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:59:12,269][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:59:12,590][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:59:12,910][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:59:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:59:13,553][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:59:13,874][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:59:14,195][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:59:14,517][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:59:14,838][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:59:15,159][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:59:15,480][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:59:15,801][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:59:16,122][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:59:16,444][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:59:16,765][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:59:17,086][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:59:17,408][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:59:17,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:59:18,050][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:59:18,371][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:59:18,692][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:59:19,014][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:59:19,335][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:59:19,657][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:59:19,979][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:59:20,300][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:59:20,621][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:59:20,942][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:59:21,264][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:59:21,585][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:59:21,907][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:59:22,228][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:59:22,550][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:59:22,872][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:59:23,194][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:59:23,516][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:59:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:59:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:59:24,479][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:59:24,800][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:59:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:59:25,742][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:59:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:59:26,387][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:59:26,708][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:59:27,030][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:59:27,352][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:59:27,673][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:59:27,995][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:59:28,317][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:59:28,638][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:59:28,959][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:59:29,281][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:59:29,603][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:59:30,262][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:59:31,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:59:31,021][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:59:31,023][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:59:31,663][__main__][INFO] - Iteration 47 took 27s (11.71% Gen, 85.94% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 11m 27s. Estimated total time: 7h 35m 54s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 57s. [2026-03-25 15:59:31,666][__main__][INFO] - Starting iteration 47. [2026-03-25 15:59:31,669][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:59:31,670][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 15:59:34,890][__main__][INFO] - Number of regex retries in iteration 47: 0 [2026-03-25 15:59:34,891][__main__][INFO] - agents played in iteration 47 are Alice, Bob [2026-03-25 15:59:35,483][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 15:59:36,142][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 15:59:36,434][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 15:59:36,757][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 15:59:37,078][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 15:59:37,399][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 15:59:37,720][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 15:59:38,042][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 15:59:38,363][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 15:59:38,684][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 15:59:39,006][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 15:59:39,328][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 15:59:39,650][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 15:59:39,972][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 15:59:40,294][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 15:59:40,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 15:59:40,938][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 15:59:41,260][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 15:59:41,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 15:59:41,902][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 15:59:42,224][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 15:59:42,546][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 15:59:42,868][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 15:59:43,191][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 15:59:43,512][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 15:59:43,833][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 15:59:44,155][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 15:59:44,476][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 15:59:44,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 15:59:45,120][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 15:59:45,442][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 15:59:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 15:59:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 15:59:46,406][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 15:59:46,726][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 15:59:47,047][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 15:59:47,370][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 15:59:47,691][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 15:59:48,011][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 15:59:48,332][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 15:59:48,653][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 15:59:48,974][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 15:59:49,295][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 15:59:49,616][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 15:59:49,937][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 15:59:50,259][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 15:59:50,580][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 15:59:50,903][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 15:59:51,224][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 15:59:51,545][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 15:59:51,866][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 15:59:52,186][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 15:59:52,507][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 15:59:53,121][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 15:59:53,443][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 15:59:53,765][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 15:59:54,086][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 15:59:54,407][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 15:59:54,728][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 15:59:55,050][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 15:59:55,372][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 15:59:55,694][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 15:59:56,016][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 15:59:56,337][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 15:59:56,659][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 15:59:56,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 15:59:57,634][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 15:59:58,377][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 15:59:58,379][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 15:59:58,381][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 15:59:59,033][__main__][INFO] - Iteration 48 took 27s (11.77% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 11m 10s. Estimated total time: 7h 36m 5s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 36s, 500 more iterations: 3h 48m 2s. [2026-03-25 15:59:59,035][__main__][INFO] - Starting iteration 48. [2026-03-25 15:59:59,038][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 15:59:59,039][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:00:02,255][__main__][INFO] - Number of regex retries in iteration 48: 0 [2026-03-25 16:00:02,255][__main__][INFO] - agents played in iteration 48 are Alice, Bob [2026-03-25 16:00:02,822][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:00:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:00:03,773][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:00:04,094][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:00:04,415][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:00:04,735][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:00:05,056][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:00:05,377][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:00:05,699][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:00:06,020][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:00:06,343][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:00:06,666][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:00:06,988][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:00:07,309][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:00:07,629][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:00:07,950][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:00:08,272][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:00:08,593][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:00:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:00:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:00:09,558][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:00:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:00:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:00:10,520][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:00:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:00:11,165][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:00:11,490][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:00:11,811][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:00:12,133][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:00:12,454][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:00:12,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:00:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:00:13,419][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:00:13,740][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:00:14,062][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:00:14,383][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:00:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:00:15,026][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:00:15,348][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:00:15,670][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:00:15,991][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:00:16,312][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:00:16,634][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:00:16,955][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:00:17,277][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:00:17,600][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:00:17,921][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:00:18,243][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:00:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:00:18,887][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:00:19,208][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:00:19,530][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:00:19,850][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:00:20,469][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:00:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:00:21,111][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:00:21,431][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:00:21,752][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:00:22,073][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:00:22,396][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:00:22,717][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:00:23,037][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:00:23,358][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:00:23,680][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:00:24,001][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:00:24,322][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:00:24,981][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:00:25,715][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:00:25,717][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:00:25,719][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:00:26,421][__main__][INFO] - Iteration 49 took 27s (11.75% Gen, 85.68% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 11m 1s. Estimated total time: 7h 36m 23s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 11s. [2026-03-25 16:00:26,423][__main__][INFO] - Starting iteration 49. [2026-03-25 16:00:26,426][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 16:00:26,427][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:00:29,678][__main__][INFO] - Number of regex retries in iteration 49: 0 [2026-03-25 16:00:29,679][__main__][INFO] - agents played in iteration 49 are Alice, Bob [2026-03-25 16:00:30,255][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:00:30,914][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:00:31,205][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:00:31,526][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:00:31,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:00:32,170][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:00:32,491][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:00:32,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:00:33,134][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:00:33,455][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:00:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:00:34,097][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:00:34,418][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:00:34,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:00:35,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:00:35,382][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:00:35,706][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:00:36,029][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:00:36,351][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:00:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:00:36,993][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:00:37,315][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:00:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:00:37,957][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:00:38,279][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:00:38,600][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:00:38,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:00:39,242][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:00:39,563][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:00:39,883][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:00:40,205][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:00:40,527][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:00:40,849][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:00:41,170][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:00:41,492][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:00:41,814][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:00:42,135][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:00:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:00:42,777][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:00:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:00:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:00:43,742][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:00:44,066][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:00:44,387][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:00:44,709][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:00:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:00:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:00:45,674][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:00:45,996][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:00:46,318][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:00:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:00:46,962][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:00:47,284][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:00:47,900][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:00:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:00:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:00:48,863][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:00:49,185][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:00:49,507][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:00:49,828][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:00:50,150][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:00:50,471][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:00:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:00:51,116][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:00:51,438][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:00:51,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:00:52,421][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:00:53,164][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:00:53,166][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:00:53,168][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:00:53,810][__main__][INFO] - Iteration 50 took 27s (11.88% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 10m 36s. Estimated total time: 7h 36m 25s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 12s. [2026-03-25 16:00:53,813][__main__][INFO] - Starting iteration 50. [2026-03-25 16:00:53,816][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 0 and human policies 1. [2026-03-25 16:00:53,816][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:00:57,027][__main__][INFO] - Number of regex retries in iteration 50: 0 [2026-03-25 16:00:57,028][__main__][INFO] - agents played in iteration 50 are Alice, Bob [2026-03-25 16:00:57,601][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:00:58,263][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:00:58,554][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:00:58,877][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:00:59,198][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:00:59,520][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:00:59,842][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:01:00,164][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:01:00,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:01:00,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:01:01,129][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:01:01,451][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:01:01,773][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:01:02,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:01:02,418][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:01:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:01:03,063][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:01:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:01:03,708][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:01:04,029][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:01:04,352][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:01:04,672][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:01:04,994][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:01:05,316][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:01:05,637][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:01:05,958][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:01:06,279][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:01:06,602][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:01:06,923][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:01:07,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:01:07,566][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:01:07,888][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:01:08,209][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:01:08,532][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:01:08,853][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:01:09,174][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:01:09,495][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:01:09,817][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:01:10,139][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:01:10,461][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:01:10,782][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:01:11,104][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:01:11,425][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:01:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:01:12,068][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:01:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:01:12,712][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:01:13,033][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:01:13,355][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:01:13,675][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:01:13,997][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:01:14,318][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:01:14,639][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:01:15,257][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:01:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:01:15,900][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:01:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:01:16,545][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:01:16,866][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:01:17,188][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:01:17,510][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:01:17,832][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:01:18,153][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:01:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:01:18,799][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:01:19,120][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:01:19,776][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:01:20,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:01:20,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:01:20,556][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:01:21,818][__main__][INFO] - Iteration 51 took 28s (11.47% Gen, 84.02% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 20m 26s. Estimated total time: 7h 46m 43s. Time estimates for 10 more iterations: 4m 40s, 100 more iterations: 46m 40s, 500 more iterations: 3h 53m 21s. [2026-03-25 16:01:21,820][__main__][INFO] - Starting iteration 51. [2026-03-25 16:01:21,823][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:01:21,824][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:01:25,046][__main__][INFO] - Number of regex retries in iteration 51: 0 [2026-03-25 16:01:25,047][__main__][INFO] - agents played in iteration 51 are Alice, Bob [2026-03-25 16:01:25,627][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:01:26,286][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:01:26,579][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:01:26,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:01:27,222][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:01:27,544][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:01:27,865][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:01:28,187][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:01:28,509][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:01:28,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:01:29,152][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:01:29,473][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:01:29,794][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:01:30,116][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:01:30,437][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:01:30,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:01:31,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:01:31,400][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:01:31,722][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:01:32,044][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:01:32,367][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:01:32,689][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:01:33,011][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:01:33,332][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:01:33,654][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:01:33,976][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:01:34,298][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:01:34,618][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:01:34,941][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:01:35,263][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:01:35,586][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:01:35,908][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:01:36,230][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:01:36,552][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:01:36,873][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:01:37,195][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:01:37,516][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:01:37,838][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:01:38,160][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:01:38,481][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:01:38,802][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:01:39,124][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:01:39,446][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:01:39,768][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:01:40,090][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:01:40,411][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:01:40,732][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:01:41,054][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:01:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:01:41,697][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:01:42,019][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:01:42,341][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:01:42,663][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:01:43,278][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:01:43,599][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:01:43,921][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:01:44,243][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:01:44,564][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:01:44,886][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:01:45,208][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:01:45,529][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:01:45,852][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:01:46,173][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:01:46,495][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:01:46,816][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:01:47,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:01:48,608][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:22 [2026-03-25 16:01:49,356][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:01:49,358][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:01:49,360][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:01:50,011][__main__][INFO] - Iteration 52 took 28s (11.43% Gen, 86.25% Train). Generation: 3s, Training: 24s. Estimated remaining time: 7h 23m 3s. Estimated total time: 7h 49m 49s. Time estimates for 10 more iterations: 4m 41s, 100 more iterations: 46m 58s, 500 more iterations: 3h 54m 54s. [2026-03-25 16:01:50,013][__main__][INFO] - Starting iteration 52. [2026-03-25 16:01:50,017][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:01:50,017][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:01:53,232][__main__][INFO] - Number of regex retries in iteration 52: 0 [2026-03-25 16:01:53,232][__main__][INFO] - agents played in iteration 52 are Alice, Bob [2026-03-25 16:01:53,808][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:01:54,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:01:54,757][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:01:55,078][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:01:55,400][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:01:55,723][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:01:56,045][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:01:56,366][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:01:56,687][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:01:57,009][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:01:57,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:01:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:01:57,972][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:01:58,293][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:01:58,615][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:01:58,935][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:01:59,256][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:01:59,577][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:01:59,898][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:02:00,220][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:02:00,541][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:02:00,863][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:02:01,184][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:02:01,506][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:02:01,828][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:02:02,149][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:02:02,471][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:02:02,793][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:02:03,115][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:02:03,436][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:02:03,758][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:02:04,078][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:02:04,399][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:02:04,721][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:02:05,043][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:02:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:02:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:02:06,009][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:02:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:02:06,653][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:02:06,974][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:02:07,296][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:02:07,617][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:02:07,940][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:02:08,261][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:02:08,585][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:02:08,906][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:02:09,228][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:02:09,550][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:02:09,872][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:02:10,193][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:02:10,514][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:02:10,836][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:02:11,453][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:02:11,775][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:02:12,098][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:02:12,421][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:02:12,743][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:02:13,065][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:02:13,386][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:02:13,707][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:02:14,028][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:02:14,350][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:02:14,672][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:02:14,993][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:02:15,315][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:02:15,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:02:16,718][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:02:16,720][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:02:16,722][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:02:17,342][__main__][INFO] - Iteration 53 took 27s (11.77% Gen, 85.96% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 8m 13s. Estimated total time: 7h 35m 26s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 32s, 500 more iterations: 3h 47m 43s. [2026-03-25 16:02:17,344][__main__][INFO] - Starting iteration 53. [2026-03-25 16:02:17,347][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:02:17,348][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:02:20,808][__main__][INFO] - Number of regex retries in iteration 53: 0 [2026-03-25 16:02:20,809][__main__][INFO] - agents played in iteration 53 are Alice, Bob [2026-03-25 16:02:21,385][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:02:22,047][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:02:22,339][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:02:22,662][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:02:22,983][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:02:23,305][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:02:23,626][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:02:23,948][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:02:24,269][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:02:24,591][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:02:24,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:02:25,234][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:02:25,554][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:02:25,875][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:02:26,196][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:02:26,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:02:26,838][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:02:27,161][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:02:27,483][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:02:27,805][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:02:28,127][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:02:28,449][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:02:28,769][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:02:29,090][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:02:29,411][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:02:29,732][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:02:30,053][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:02:30,376][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:02:30,697][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:02:31,020][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:02:31,341][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:02:31,663][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:02:31,985][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:02:32,305][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:02:32,627][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:02:32,949][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:02:33,271][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:02:33,592][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:02:33,915][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:02:34,236][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:02:34,558][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:02:34,879][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:02:35,202][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:02:35,524][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:02:35,847][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:02:36,169][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:02:36,491][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:02:36,812][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:02:37,134][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:02:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:02:37,778][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:02:38,099][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:02:38,421][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:02:39,037][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:02:39,359][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:02:39,680][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:02:40,001][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:02:40,323][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:02:40,644][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:02:40,966][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:02:41,287][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:02:41,608][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:02:41,930][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:02:42,252][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:02:42,574][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:02:42,895][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:02:43,550][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:02:44,288][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:02:44,290][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:02:44,292][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:02:44,936][__main__][INFO] - Iteration 54 took 27s (12.55% Gen, 85.11% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 12m 9s. Estimated total time: 7h 39m 50s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 59s, 500 more iterations: 3h 49m 55s. [2026-03-25 16:02:44,938][__main__][INFO] - Starting iteration 54. [2026-03-25 16:02:44,941][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:02:44,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:02:48,176][__main__][INFO] - Number of regex retries in iteration 54: 0 [2026-03-25 16:02:48,177][__main__][INFO] - agents played in iteration 54 are Alice, Bob [2026-03-25 16:02:48,735][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:02:49,395][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:02:49,686][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:02:50,007][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:02:50,328][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:02:50,649][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:02:50,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:02:51,290][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:02:51,611][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:02:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:02:52,252][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:02:52,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:02:52,896][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:02:53,218][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:02:53,540][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:02:53,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:02:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:02:54,504][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:02:54,826][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:02:55,148][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:02:55,470][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:02:55,792][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:02:56,114][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:02:56,435][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:02:56,757][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:02:57,079][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:02:57,402][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:02:57,724][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:02:58,045][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:02:58,366][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:02:58,689][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:02:59,010][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:02:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:02:59,653][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:02:59,975][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:03:00,296][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:03:00,618][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:03:00,938][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:03:01,259][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:03:01,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:03:01,903][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:03:02,225][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:03:02,546][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:03:02,868][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:03:03,189][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:03:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:03:03,832][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:03:04,154][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:03:04,475][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:03:04,797][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:03:05,119][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:03:05,441][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:03:05,762][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:03:06,378][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:03:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:03:07,021][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:03:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:03:07,665][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:03:07,985][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:03:08,306][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:03:08,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:03:08,949][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:03:09,270][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:03:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:03:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:03:10,236][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:03:10,890][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:03:11,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:03:11,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:03:11,637][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:03:12,276][__main__][INFO] - Iteration 55 took 27s (11.84% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 7m 28s. Estimated total time: 7h 35m 36s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 48s. [2026-03-25 16:03:12,279][__main__][INFO] - Starting iteration 55. [2026-03-25 16:03:12,282][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:03:12,282][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:03:15,498][__main__][INFO] - Number of regex retries in iteration 55: 0 [2026-03-25 16:03:15,498][__main__][INFO] - agents played in iteration 55 are Alice, Bob [2026-03-25 16:03:16,074][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:03:16,734][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:03:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:03:17,349][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:03:17,669][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:03:17,990][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:03:18,311][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:03:18,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:03:18,953][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:03:19,274][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:03:19,595][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:03:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:03:20,239][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:03:20,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:03:20,881][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:03:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:03:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:03:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:03:22,166][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:03:22,487][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:03:22,809][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:03:23,130][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:03:23,452][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:03:23,772][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:03:24,094][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:03:24,415][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:03:24,737][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:03:25,058][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:03:25,380][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:03:25,701][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:03:26,023][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:03:26,345][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:03:26,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:03:26,988][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:03:27,310][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:03:27,631][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:03:27,953][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:03:28,275][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:03:28,595][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:03:28,917][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:03:29,237][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:03:29,558][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:03:29,879][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:03:30,201][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:03:30,524][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:03:30,845][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:03:31,168][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:03:31,490][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:03:31,810][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:03:32,131][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:03:32,453][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:03:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:03:33,095][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:03:33,712][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:03:34,034][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:03:34,355][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:03:34,677][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:03:34,998][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:03:35,320][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:03:35,642][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:03:35,963][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:03:36,284][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:03:36,606][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:03:36,928][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:03:37,250][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:03:37,571][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:03:38,227][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:03:38,974][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:03:38,976][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:03:38,977][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:03:39,624][__main__][INFO] - Iteration 56 took 27s (11.76% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 7m 8s. Estimated total time: 7h 35m 43s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 34s, 500 more iterations: 3h 47m 51s. [2026-03-25 16:03:39,627][__main__][INFO] - Starting iteration 56. [2026-03-25 16:03:39,630][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:03:39,630][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:03:42,868][__main__][INFO] - Number of regex retries in iteration 56: 0 [2026-03-25 16:03:42,869][__main__][INFO] - agents played in iteration 56 are Alice, Bob [2026-03-25 16:03:43,433][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:03:44,093][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:03:44,385][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:03:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:03:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:03:45,349][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:03:45,670][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:03:45,991][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:03:46,312][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:03:46,634][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:03:46,955][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:03:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:03:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:03:47,920][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:03:48,242][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:03:48,563][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:03:48,884][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:03:49,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:03:49,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:03:49,848][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:03:50,170][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:03:50,491][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:03:50,812][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:03:51,133][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:03:51,456][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:03:51,779][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:03:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:03:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:03:52,741][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:03:53,062][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:03:53,384][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:03:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:03:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:03:54,348][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:03:54,670][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:03:54,991][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:03:55,311][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:03:55,633][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:03:55,954][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:03:56,276][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:03:56,597][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:03:56,919][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:03:57,241][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:03:57,562][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:03:57,884][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:03:58,206][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:03:58,528][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:03:58,849][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:03:59,171][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:03:59,493][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:03:59,813][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:04:00,134][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:04:00,454][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:04:01,070][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:04:01,392][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:04:01,714][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:04:02,036][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:04:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:04:02,677][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:04:02,998][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:04:03,319][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:04:03,640][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:04:03,961][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:04:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:04:04,604][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:04:04,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:04:05,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:04:06,331][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:04:06,333][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:04:06,335][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:04:06,983][__main__][INFO] - Iteration 57 took 27s (11.84% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 6m 51s. Estimated total time: 7h 35m 54s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 57s. [2026-03-25 16:04:06,985][__main__][INFO] - Starting iteration 57. [2026-03-25 16:04:06,988][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:04:06,989][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:04:10,216][__main__][INFO] - Number of regex retries in iteration 57: 0 [2026-03-25 16:04:10,217][__main__][INFO] - agents played in iteration 57 are Alice, Bob [2026-03-25 16:04:10,766][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:04:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:04:11,723][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:04:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:04:12,367][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:04:12,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:04:13,011][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:04:13,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:04:13,654][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:04:13,975][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:04:14,296][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:04:14,617][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:04:14,938][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:04:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:04:15,579][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:04:15,900][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:04:16,222][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:04:16,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:04:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:04:17,186][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:04:17,507][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:04:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:04:18,150][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:04:18,472][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:04:18,794][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:04:19,115][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:04:19,437][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:04:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:04:20,080][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:04:20,402][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:04:20,723][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:04:21,045][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:04:21,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:04:21,686][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:04:22,008][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:04:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:04:22,651][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:04:22,972][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:04:23,294][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:04:23,616][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:04:23,938][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:04:24,260][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:04:24,581][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:04:24,903][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:04:25,224][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:04:25,546][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:04:25,867][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:04:26,188][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:04:26,510][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:04:26,831][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:04:27,153][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:04:27,474][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:04:27,795][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:04:28,416][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:04:28,738][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:04:29,059][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:04:29,381][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:04:29,702][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:04:30,024][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:04:30,345][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:04:30,666][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:04:30,988][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:04:31,308][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:04:31,630][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:04:31,952][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:04:32,272][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:04:32,930][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:04:33,653][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:04:33,655][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:04:33,657][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:04:34,291][__main__][INFO] - Iteration 58 took 27s (11.82% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 5m 34s. Estimated total time: 7h 35m 3s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 30s, 500 more iterations: 3h 47m 31s. [2026-03-25 16:04:34,293][__main__][INFO] - Starting iteration 58. [2026-03-25 16:04:34,296][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:04:34,297][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:04:37,538][__main__][INFO] - Number of regex retries in iteration 58: 0 [2026-03-25 16:04:37,539][__main__][INFO] - agents played in iteration 58 are Alice, Bob [2026-03-25 16:04:38,112][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:04:38,776][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:04:39,068][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:04:39,390][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:04:39,711][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:04:40,032][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:04:40,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:04:40,675][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:04:40,997][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:04:41,318][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:04:41,639][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:04:41,961][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:04:42,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:04:42,602][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:04:42,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:04:43,246][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:04:43,568][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:04:43,889][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:04:44,211][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:04:44,532][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:04:44,853][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:04:45,174][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:04:45,495][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:04:45,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:04:46,139][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:04:46,460][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:04:46,781][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:04:47,103][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:04:47,425][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:04:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:04:48,066][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:04:48,388][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:04:48,710][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:04:49,031][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:04:49,353][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:04:49,674][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:04:49,997][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:04:50,318][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:04:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:04:50,964][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:04:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:04:51,607][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:04:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:04:52,251][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:04:52,572][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:04:52,892][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:04:53,215][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:04:53,536][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:04:53,857][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:04:54,179][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:04:54,500][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:04:54,821][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:04:55,144][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:04:55,763][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:04:56,084][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:04:56,405][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:04:56,726][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:04:57,049][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:04:57,370][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:04:57,691][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:04:58,013][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:04:58,333][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:04:58,654][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:04:58,976][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:04:59,297][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:04:59,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:05:00,277][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:05:01,010][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:05:01,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:05:01,014][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:05:01,658][__main__][INFO] - Iteration 59 took 27s (11.85% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 6m 5s. Estimated total time: 7h 36m 2s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 36s, 500 more iterations: 3h 48m 1s. [2026-03-25 16:05:01,660][__main__][INFO] - Starting iteration 59. [2026-03-25 16:05:01,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:05:01,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:05:04,893][__main__][INFO] - Number of regex retries in iteration 59: 0 [2026-03-25 16:05:04,894][__main__][INFO] - agents played in iteration 59 are Alice, Bob [2026-03-25 16:05:05,449][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:05:06,117][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:05:06,409][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:05:06,730][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:05:07,051][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:05:07,372][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:05:07,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:05:08,015][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:05:08,336][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:05:08,656][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:05:08,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:05:09,300][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:05:09,621][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:05:09,943][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:05:10,264][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:05:10,585][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:05:10,906][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:05:11,227][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:05:11,549][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:05:11,870][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:05:12,191][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:05:12,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:05:12,833][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:05:13,155][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:05:13,477][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:05:13,798][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:05:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:05:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:05:14,761][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:05:15,083][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:05:15,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:05:15,725][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:05:16,046][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:05:16,367][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:05:16,688][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:05:17,010][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:05:17,332][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:05:17,653][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:05:17,974][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:05:18,295][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:05:18,617][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:05:18,938][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:05:19,260][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:05:19,581][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:05:19,903][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:05:20,224][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:05:20,545][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:05:20,866][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:05:21,187][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:05:21,509][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:05:21,830][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:05:22,152][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:05:22,472][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:05:23,094][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:05:23,416][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:05:23,737][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:05:24,059][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:05:24,380][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:05:24,702][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:05:25,023][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:05:25,345][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:05:25,666][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:05:25,988][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:05:26,309][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:05:26,631][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:05:26,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:05:27,610][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:05:28,388][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:05:28,390][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:05:28,392][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:05:29,041][__main__][INFO] - Iteration 60 took 27s (11.80% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 5m 54s. Estimated total time: 7h 36m 18s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 9s. [2026-03-25 16:05:29,043][__main__][INFO] - Starting iteration 60. [2026-03-25 16:05:29,046][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:05:29,047][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:05:32,261][__main__][INFO] - Number of regex retries in iteration 60: 0 [2026-03-25 16:05:32,262][__main__][INFO] - agents played in iteration 60 are Alice, Bob [2026-03-25 16:05:32,836][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:05:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:05:33,792][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:05:34,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:05:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:05:34,756][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:05:35,077][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:05:35,398][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:05:35,719][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:05:36,042][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:05:36,364][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:05:36,685][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:05:37,007][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:05:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:05:37,650][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:05:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:05:38,293][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:05:38,614][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:05:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:05:39,255][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:05:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:05:39,898][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:05:40,219][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:05:40,540][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:05:40,860][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:05:41,181][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:05:41,501][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:05:41,822][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:05:42,144][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:05:42,465][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:05:42,786][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:05:43,107][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:05:43,428][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:05:43,749][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:05:44,071][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:05:44,392][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:05:44,714][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:05:45,036][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:05:45,357][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:05:45,678][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:05:45,999][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:05:46,321][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:05:46,642][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:05:46,965][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:05:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:05:47,610][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:05:47,931][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:05:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:05:48,574][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:05:48,896][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:05:49,217][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:05:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:05:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:05:50,479][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:05:50,801][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:05:51,122][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:05:51,444][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:05:51,765][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:05:52,087][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:05:52,409][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:05:52,730][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:05:53,050][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:05:53,372][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:05:53,693][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:05:54,015][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:05:54,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:05:54,995][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:05:55,751][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:05:55,753][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:05:55,755][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:05:56,409][__main__][INFO] - Iteration 61 took 27s (11.75% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 5m 11s. Estimated total time: 7h 36m 3s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 36s, 500 more iterations: 3h 48m 1s. [2026-03-25 16:05:56,411][__main__][INFO] - Starting iteration 61. [2026-03-25 16:05:56,415][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:05:56,415][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:05:59,661][__main__][INFO] - Number of regex retries in iteration 61: 0 [2026-03-25 16:05:59,662][__main__][INFO] - agents played in iteration 61 are Alice, Bob [2026-03-25 16:06:00,240][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:06:00,903][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:06:01,195][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:06:01,516][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:06:01,837][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:06:02,158][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:06:02,479][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:06:02,800][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:06:03,121][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:06:03,442][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:06:03,764][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:06:04,086][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:06:04,407][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:06:04,727][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:06:05,048][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:06:05,371][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:06:05,694][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:06:06,015][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:06:06,337][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:06:06,657][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:06:06,978][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:06:07,299][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:06:07,621][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:06:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:06:08,267][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:06:08,588][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:06:08,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:06:09,233][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:06:09,555][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:06:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:06:10,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:06:10,521][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:06:10,844][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:06:11,167][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:06:11,489][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:06:11,812][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:06:12,134][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:06:12,455][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:06:12,776][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:06:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:06:13,419][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:06:13,741][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:06:14,063][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:06:14,385][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:06:14,706][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:06:15,028][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:06:15,350][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:06:15,673][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:06:15,995][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:06:16,316][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:06:16,639][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:06:16,961][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:06:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:06:17,905][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:06:18,228][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:06:18,550][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:06:18,872][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:06:19,194][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:06:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:06:19,837][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:06:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:06:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:06:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:06:21,124][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:06:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:06:21,767][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:06:22,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:06:23,173][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:06:23,175][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:06:23,177][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:06:23,820][__main__][INFO] - Iteration 62 took 27s (11.85% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 5m 26s. Estimated total time: 7h 36m 46s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 23s. [2026-03-25 16:06:23,822][__main__][INFO] - Starting iteration 62. [2026-03-25 16:06:23,825][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:06:23,825][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:06:27,032][__main__][INFO] - Number of regex retries in iteration 62: 0 [2026-03-25 16:06:27,033][__main__][INFO] - agents played in iteration 62 are Alice, Bob [2026-03-25 16:06:27,608][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:06:28,273][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:06:28,564][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:06:28,886][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:06:29,207][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:06:29,528][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:06:29,849][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:06:30,171][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:06:30,492][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:06:30,814][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:06:31,136][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:06:31,457][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:06:31,780][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:06:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:06:32,423][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:06:32,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:06:33,066][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:06:33,387][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:06:33,709][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:06:34,030][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:06:34,352][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:06:34,673][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:06:34,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:06:35,316][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:06:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:06:35,961][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:06:36,283][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:06:36,605][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:06:36,926][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:06:37,248][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:06:37,571][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:06:37,894][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:06:38,216][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:06:38,537][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:06:38,859][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:06:39,181][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:06:39,502][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:06:39,824][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:06:40,145][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:06:40,467][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:06:40,790][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:06:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:06:41,433][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:06:41,755][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:06:42,077][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:06:42,399][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:06:42,720][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:06:43,042][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:06:43,364][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:06:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:06:44,007][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:06:44,328][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:06:44,649][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:06:45,267][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:06:45,588][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:06:45,909][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:06:46,231][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:06:46,552][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:06:46,874][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:06:47,196][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:06:47,518][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:06:47,839][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:06:48,160][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:06:48,482][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:06:48,805][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:06:49,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:06:49,785][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:06:50,525][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:06:50,527][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:06:50,529][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:06:51,169][__main__][INFO] - Iteration 63 took 27s (11.73% Gen, 85.93% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 3m 58s. Estimated total time: 7h 35m 44s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 34s, 500 more iterations: 3h 47m 52s. [2026-03-25 16:06:51,171][__main__][INFO] - Starting iteration 63. [2026-03-25 16:06:51,174][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:06:51,175][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:06:54,373][__main__][INFO] - Number of regex retries in iteration 63: 0 [2026-03-25 16:06:54,373][__main__][INFO] - agents played in iteration 63 are Alice, Bob [2026-03-25 16:06:54,958][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:06:55,623][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:06:55,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:06:56,237][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:06:56,558][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:06:56,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:06:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:06:57,523][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:06:57,844][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:06:58,167][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:06:58,489][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:06:58,811][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:06:59,133][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:06:59,454][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:06:59,775][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:07:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:07:00,419][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:07:00,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:07:01,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:07:01,383][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:07:01,705][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:07:02,027][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:07:02,349][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:07:02,669][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:07:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:07:03,311][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:07:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:07:03,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:07:04,274][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:07:04,594][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:07:04,916][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:07:05,237][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:07:05,558][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:07:05,880][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:07:06,201][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:07:06,524][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:07:06,845][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:07:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:07:07,489][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:07:07,810][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:07:08,133][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:07:08,455][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:07:08,777][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:07:09,099][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:07:09,421][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:07:09,743][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:07:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:07:10,387][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:07:10,709][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:07:11,031][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:07:11,352][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:07:11,674][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:07:11,996][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:07:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:07:12,936][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:07:13,258][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:07:13,579][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:07:13,902][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:07:14,224][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:07:14,546][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:07:14,867][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:07:15,188][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:07:15,510][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:07:15,831][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:07:16,152][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:07:16,474][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:07:17,132][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:07:17,871][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:07:17,873][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:07:17,875][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:07:18,511][__main__][INFO] - Iteration 64 took 27s (11.70% Gen, 85.96% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 3m 24s. Estimated total time: 7h 35m 38s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 49s. [2026-03-25 16:07:18,513][__main__][INFO] - Starting iteration 64. [2026-03-25 16:07:18,516][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:07:18,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:07:21,747][__main__][INFO] - Number of regex retries in iteration 64: 0 [2026-03-25 16:07:21,748][__main__][INFO] - agents played in iteration 64 are Alice, Bob [2026-03-25 16:07:22,353][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:07:23,018][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:07:23,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:07:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:07:23,953][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:07:24,274][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:07:24,595][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:07:24,917][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:07:25,239][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:07:25,561][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:07:25,882][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:07:26,204][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:07:26,526][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:07:26,849][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:07:27,171][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:07:27,492][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:07:27,814][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:07:28,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:07:28,457][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:07:28,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:07:29,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:07:29,422][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:07:29,744][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:07:30,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:07:30,390][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:07:30,712][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:07:31,034][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:07:31,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:07:31,680][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:07:32,002][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:07:32,324][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:07:32,646][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:07:32,967][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:07:33,289][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:07:33,610][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:07:33,931][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:07:34,253][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:07:34,574][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:07:34,896][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:07:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:07:35,540][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:07:35,861][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:07:36,183][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:07:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:07:36,826][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:07:37,147][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:07:37,469][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:07:37,790][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:07:38,112][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:07:38,434][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:07:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:07:39,078][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:07:39,399][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:07:40,018][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:07:40,340][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:07:40,661][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:07:40,983][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:07:41,305][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:07:41,627][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:07:41,950][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:07:42,272][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:07:42,594][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:07:42,917][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:07:43,239][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:07:43,561][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:07:43,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:07:44,547][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:07:45,309][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:07:45,312][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:07:45,313][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:07:45,952][__main__][INFO] - Iteration 65 took 27s (11.77% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 4m 35s. Estimated total time: 7h 37m 16s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 43s, 500 more iterations: 3h 48m 38s. [2026-03-25 16:07:45,954][__main__][INFO] - Starting iteration 65. [2026-03-25 16:07:45,957][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:07:45,958][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:07:49,166][__main__][INFO] - Number of regex retries in iteration 65: 0 [2026-03-25 16:07:49,167][__main__][INFO] - agents played in iteration 65 are Alice, Bob [2026-03-25 16:07:49,760][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:07:50,426][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:07:50,718][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:07:51,040][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:07:51,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:07:51,682][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:07:52,003][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:07:52,324][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:07:52,646][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:07:52,968][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:07:53,290][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:07:53,611][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:07:53,934][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:07:54,255][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:07:54,577][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:07:54,899][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:07:55,221][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:07:55,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:07:55,865][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:07:56,187][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:07:56,508][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:07:56,829][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:07:57,151][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:07:57,472][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:07:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:07:58,114][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:07:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:07:58,758][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:07:59,079][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:07:59,400][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:07:59,722][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:08:00,043][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:08:00,366][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:08:00,687][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:08:01,009][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:08:01,331][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:08:01,652][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:08:01,974][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:08:02,295][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:08:02,617][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:08:02,939][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:08:03,260][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:08:03,582][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:08:03,903][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:08:04,225][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:08:04,546][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:08:04,868][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:08:05,190][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:08:05,512][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:08:05,834][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:08:06,157][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:08:06,478][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:08:06,800][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:08:07,419][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:08:07,741][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:08:08,063][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:08:08,385][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:08:08,708][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:08:09,030][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:08:09,351][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:08:09,673][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:08:09,994][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:08:10,317][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:08:10,638][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:08:10,960][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:08:11,282][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:08:11,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:08:12,683][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:08:12,685][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:08:12,687][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:08:13,327][__main__][INFO] - Iteration 66 took 27s (11.72% Gen, 85.93% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 3m 2s. Estimated total time: 7h 36m 11s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 5s. [2026-03-25 16:08:13,330][__main__][INFO] - Starting iteration 66. [2026-03-25 16:08:13,333][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:08:13,334][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:08:16,549][__main__][INFO] - Number of regex retries in iteration 66: 0 [2026-03-25 16:08:16,550][__main__][INFO] - agents played in iteration 66 are Alice, Bob [2026-03-25 16:08:17,132][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:08:17,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:08:18,093][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:08:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:08:18,736][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:08:19,057][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:08:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:08:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:08:20,022][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:08:20,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:08:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:08:20,988][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:08:21,310][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:08:21,632][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:08:21,954][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:08:22,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:08:22,597][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:08:22,919][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:08:23,240][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:08:23,562][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:08:23,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:08:24,205][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:08:24,527][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:08:24,849][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:08:25,171][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:08:25,493][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:08:25,815][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:08:26,136][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:08:26,458][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:08:26,779][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:08:27,100][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:08:27,421][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:08:27,743][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:08:28,065][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:08:28,386][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:08:28,708][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:08:29,030][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:08:29,351][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:08:29,673][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:08:29,995][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:08:30,317][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:08:30,639][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:08:30,961][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:08:31,282][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:08:31,603][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:08:31,926][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:08:32,248][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:08:32,569][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:08:32,891][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:08:33,213][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:08:33,534][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:08:33,855][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:08:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:08:34,797][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:08:35,119][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:08:35,440][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:08:35,761][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:08:36,083][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:08:36,405][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:08:36,727][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:08:37,048][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:08:37,370][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:08:37,692][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:08:38,014][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:08:38,336][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:08:38,657][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:08:39,321][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:08:40,073][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:08:40,075][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:08:40,078][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:08:40,740][__main__][INFO] - Iteration 67 took 27s (11.73% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 3m 12s. Estimated total time: 7h 36m 48s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 24s. [2026-03-25 16:08:40,743][__main__][INFO] - Starting iteration 67. [2026-03-25 16:08:40,746][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:08:40,746][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:08:43,956][__main__][INFO] - Number of regex retries in iteration 67: 0 [2026-03-25 16:08:43,957][__main__][INFO] - agents played in iteration 67 are Alice, Bob [2026-03-25 16:08:44,541][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:08:45,207][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:08:45,498][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:08:45,820][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:08:46,141][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:08:46,462][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:08:46,784][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:08:47,105][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:08:47,427][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:08:47,748][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:08:48,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:08:48,391][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:08:48,713][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:08:49,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:08:49,355][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:08:49,677][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:08:49,999][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:08:50,322][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:08:50,644][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:08:50,966][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:08:51,288][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:08:51,610][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:08:51,931][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:08:52,253][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:08:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:08:52,898][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:08:53,220][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:08:53,541][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:08:53,864][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:08:54,186][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:08:54,508][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:08:54,831][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:08:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:08:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:08:55,796][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:08:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:08:56,440][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:08:56,762][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:08:57,084][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:08:57,407][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:08:57,729][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:08:58,052][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:08:58,374][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:08:58,696][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:08:59,017][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:08:59,339][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:08:59,661][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:08:59,983][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:09:00,305][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:09:00,627][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:09:00,950][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:09:01,272][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:09:01,593][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:09:02,213][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:09:02,535][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:09:02,858][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:09:03,179][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:09:03,502][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:09:03,824][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:09:04,146][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:09:04,468][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:09:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:09:05,112][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:09:05,433][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:09:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:09:06,077][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:09:06,739][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:09:07,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:09:07,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:09:07,488][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:09:08,140][__main__][INFO] - Iteration 68 took 27s (11.72% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 2m 31s. Estimated total time: 7h 36m 35s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 17s. [2026-03-25 16:09:08,142][__main__][INFO] - Starting iteration 68. [2026-03-25 16:09:08,145][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:09:08,146][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:09:11,396][__main__][INFO] - Number of regex retries in iteration 68: 0 [2026-03-25 16:09:11,397][__main__][INFO] - agents played in iteration 68 are Alice, Bob [2026-03-25 16:09:11,976][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:09:12,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:09:12,932][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:09:13,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:09:13,574][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:09:13,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:09:14,216][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:09:14,538][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:09:14,860][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:09:15,180][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:09:15,501][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:09:15,823][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:09:16,144][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:09:16,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:09:16,787][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:09:17,109][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:09:17,430][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:09:17,751][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:09:18,073][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:09:18,395][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:09:18,717][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:09:19,038][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:09:19,359][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:09:19,680][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:09:20,002][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:09:20,324][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:09:20,646][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:09:20,967][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:09:21,289][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:09:21,610][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:09:21,932][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:09:22,253][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:09:22,575][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:09:22,897][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:09:23,218][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:09:23,539][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:09:23,861][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:09:24,183][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:09:24,504][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:09:24,826][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:09:25,147][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:09:25,468][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:09:25,790][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:09:26,112][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:09:26,434][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:09:26,755][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:09:27,077][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:09:27,400][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:09:27,722][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:09:28,044][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:09:28,366][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:09:28,688][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:09:29,010][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:09:29,626][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:09:29,948][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:09:30,269][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:09:30,591][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:09:30,913][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:09:31,234][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:09:31,555][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:09:31,877][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:09:32,198][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:09:32,520][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:09:32,842][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:09:33,165][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:09:33,487][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:09:34,142][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:09:34,891][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:09:34,893][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:09:34,895][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:09:35,542][__main__][INFO] - Iteration 69 took 27s (11.87% Gen, 85.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 2m 6s. Estimated total time: 7h 36m 37s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 18s. [2026-03-25 16:09:35,544][__main__][INFO] - Starting iteration 69. [2026-03-25 16:09:35,548][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:09:35,548][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:09:38,765][__main__][INFO] - Number of regex retries in iteration 69: 0 [2026-03-25 16:09:38,766][__main__][INFO] - agents played in iteration 69 are Alice, Bob [2026-03-25 16:09:39,348][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:09:40,015][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:09:40,306][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:09:40,627][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:09:40,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:09:41,270][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:09:41,592][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:09:41,913][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:09:42,235][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:09:42,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:09:42,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:09:43,200][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:09:43,522][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:09:43,844][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:09:44,165][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:09:44,487][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:09:44,809][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:09:45,131][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:09:45,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:09:45,774][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:09:46,095][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:09:46,418][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:09:46,740][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:09:47,061][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:09:47,383][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:09:47,704][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:09:48,026][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:09:48,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:09:48,670][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:09:48,992][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:09:49,313][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:09:49,634][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:09:49,956][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:09:50,278][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:09:50,600][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:09:50,922][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:09:51,243][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:09:51,565][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:09:51,886][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:09:52,207][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:09:52,529][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:09:52,851][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:09:53,174][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:09:53,495][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:09:53,816][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:09:54,137][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:09:54,458][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:09:54,780][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:09:55,102][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:09:55,424][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:09:55,746][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:09:56,067][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:09:56,388][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:09:57,008][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:09:57,331][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:09:57,652][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:09:57,973][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:09:58,295][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:09:58,616][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:09:58,938][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:09:59,260][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:09:59,582][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:09:59,904][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:10:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:10:00,548][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:10:00,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:10:01,531][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:10:02,290][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:10:02,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:10:02,294][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:10:02,950][__main__][INFO] - Iteration 70 took 27s (11.74% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 1m 44s. Estimated total time: 7h 36m 43s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 21s. [2026-03-25 16:10:02,959][__main__][INFO] - Starting iteration 70. [2026-03-25 16:10:02,962][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:10:02,962][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:10:06,168][__main__][INFO] - Number of regex retries in iteration 70: 0 [2026-03-25 16:10:06,169][__main__][INFO] - agents played in iteration 70 are Alice, Bob [2026-03-25 16:10:06,750][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:10:07,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:10:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:10:08,028][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:10:08,350][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:10:08,671][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:10:08,994][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:10:09,316][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:10:09,637][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:10:09,959][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:10:10,280][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:10:10,603][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:10:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:10:11,248][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:10:11,571][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:10:11,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:10:12,216][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:10:12,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:10:12,860][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:10:13,181][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:10:13,503][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:10:13,825][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:10:14,147][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:10:14,469][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:10:14,790][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:10:15,113][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:10:15,435][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:10:15,756][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:10:16,078][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:10:16,400][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:10:16,722][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:10:17,044][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:10:17,367][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:10:17,689][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:10:18,010][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:10:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:10:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:10:18,975][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:10:19,297][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:10:19,619][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:10:19,941][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:10:20,263][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:10:20,585][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:10:20,908][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:10:21,228][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:10:21,550][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:10:21,873][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:10:22,194][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:10:22,516][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:10:22,838][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:10:23,160][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:10:23,482][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:10:23,804][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:10:24,425][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:10:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:10:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:10:25,390][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:10:25,712][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:10:26,034][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:10:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:10:26,676][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:10:26,997][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:10:27,319][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:10:27,640][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:10:27,962][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:10:28,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:10:28,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:10:29,696][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:10:29,698][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:10:29,700][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:10:30,361][__main__][INFO] - Iteration 71 took 27s (11.70% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 1m 14s. Estimated total time: 7h 36m 40s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 20s. [2026-03-25 16:10:30,363][__main__][INFO] - Starting iteration 71. [2026-03-25 16:10:30,366][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:10:30,367][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:10:33,584][__main__][INFO] - Number of regex retries in iteration 71: 0 [2026-03-25 16:10:33,585][__main__][INFO] - agents played in iteration 71 are Alice, Bob [2026-03-25 16:10:34,161][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:10:34,828][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:10:35,120][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:10:35,442][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:10:35,764][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:10:36,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:10:36,407][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:10:36,729][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:10:37,051][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:10:37,373][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:10:37,696][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:10:38,018][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:10:38,340][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:10:38,662][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:10:38,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:10:39,306][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:10:39,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:10:39,948][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:10:40,270][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:10:40,592][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:10:40,913][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:10:41,234][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:10:41,555][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:10:41,877][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:10:42,198][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:10:42,520][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:10:42,840][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:10:43,161][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:10:43,483][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:10:43,804][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:10:44,126][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:10:44,447][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:10:44,768][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:10:45,090][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:10:45,412][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:10:45,735][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:10:46,056][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:10:46,378][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:10:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:10:47,022][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:10:47,344][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:10:47,667][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:10:47,989][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:10:48,311][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:10:48,632][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:10:48,953][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:10:49,275][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:10:49,596][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:10:49,918][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:10:50,239][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:10:50,560][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:10:50,882][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:10:51,204][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:10:51,824][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:10:52,146][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:10:52,467][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:10:52,788][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:10:53,108][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:10:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:10:53,752][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:10:54,073][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:10:54,394][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:10:54,716][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:10:55,037][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:10:55,360][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:10:55,681][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:10:56,343][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:10:57,078][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:10:57,081][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:10:57,082][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:10:57,736][__main__][INFO] - Iteration 72 took 27s (11.76% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 0m 17s. Estimated total time: 7h 36m 11s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 5s. [2026-03-25 16:10:57,739][__main__][INFO] - Starting iteration 72. [2026-03-25 16:10:57,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:10:57,742][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:11:00,966][__main__][INFO] - Number of regex retries in iteration 72: 0 [2026-03-25 16:11:00,967][__main__][INFO] - agents played in iteration 72 are Alice, Bob [2026-03-25 16:11:01,556][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:11:02,224][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:11:02,516][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:11:02,838][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:11:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:11:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:11:03,802][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:11:04,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:11:04,446][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:11:04,767][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:11:05,088][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:11:05,411][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:11:05,732][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:11:06,053][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:11:06,374][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:11:06,695][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:11:07,016][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:11:07,337][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:11:07,660][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:11:07,982][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:11:08,303][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:11:08,625][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:11:08,947][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:11:09,270][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:11:09,592][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:11:09,914][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:11:10,235][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:11:10,557][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:11:10,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:11:11,200][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:11:11,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:11:11,842][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:11:12,164][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:11:12,485][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:11:12,806][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:11:13,127][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:11:13,449][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:11:13,770][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:11:14,092][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:11:14,415][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:11:14,736][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:11:15,057][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:11:15,379][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:11:15,701][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:11:16,022][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:11:16,344][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:11:16,666][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:11:16,987][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:11:17,309][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:11:17,631][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:11:17,952][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:11:18,274][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:11:18,595][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:11:19,215][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:11:19,536][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:11:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:11:20,180][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:11:20,501][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:11:20,822][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:11:21,144][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:11:21,466][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:11:21,787][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:11:22,107][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:11:22,428][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:11:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:11:23,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:11:23,732][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:11:24,474][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:11:24,476][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:11:24,478][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:11:25,147][__main__][INFO] - Iteration 73 took 27s (11.77% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 0m 25s. Estimated total time: 7h 36m 46s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 23s. [2026-03-25 16:11:25,149][__main__][INFO] - Starting iteration 73. [2026-03-25 16:11:25,153][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:11:25,154][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:11:28,377][__main__][INFO] - Number of regex retries in iteration 73: 0 [2026-03-25 16:11:28,378][__main__][INFO] - agents played in iteration 73 are Alice, Bob [2026-03-25 16:11:28,967][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:11:29,636][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:11:29,928][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:11:30,251][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:11:30,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:11:30,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:11:31,217][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:11:31,539][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:11:31,861][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:11:32,182][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:11:32,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:11:32,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:11:33,147][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:11:33,468][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:11:33,790][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:11:34,111][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:11:34,433][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:11:34,755][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:11:35,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:11:35,399][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:11:35,720][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:11:36,041][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:11:36,363][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:11:36,684][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:11:37,006][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:11:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:11:37,648][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:11:37,969][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:11:38,291][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:11:38,613][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:11:38,936][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:11:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:11:39,579][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:11:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:11:40,221][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:11:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:11:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:11:41,184][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:11:41,507][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:11:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:11:42,153][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:11:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:11:42,796][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:11:43,118][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:11:43,440][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:11:43,762][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:11:44,083][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:11:44,405][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:11:44,728][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:11:45,050][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:11:45,372][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:11:45,694][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:11:46,016][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:11:46,636][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:11:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:11:47,280][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:11:47,601][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:11:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:11:48,245][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:11:48,566][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:11:48,887][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:11:49,208][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:11:49,530][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:11:49,851][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:11:50,172][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:11:50,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:11:51,153][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:11:51,893][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:11:51,895][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:11:51,897][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:11:52,550][__main__][INFO] - Iteration 74 took 27s (11.77% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 59m 50s. Estimated total time: 7h 36m 38s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 19s. [2026-03-25 16:11:52,552][__main__][INFO] - Starting iteration 74. [2026-03-25 16:11:52,556][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:11:52,556][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:11:55,773][__main__][INFO] - Number of regex retries in iteration 74: 0 [2026-03-25 16:11:55,774][__main__][INFO] - agents played in iteration 74 are Alice, Bob [2026-03-25 16:11:56,367][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:11:57,039][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:11:57,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:11:57,652][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:11:57,974][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:11:58,296][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:11:58,619][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:11:58,940][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:11:59,262][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:11:59,584][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:11:59,905][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:12:00,226][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:12:00,548][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:12:00,869][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:12:01,190][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:12:01,511][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:12:01,832][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:12:02,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:12:02,474][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:12:02,795][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:12:03,118][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:12:03,439][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:12:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:12:04,082][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:12:04,404][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:12:04,725][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:12:05,045][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:12:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:12:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:12:06,010][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:12:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:12:06,652][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:12:06,973][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:12:07,295][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:12:07,615][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:12:07,936][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:12:08,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:12:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:12:08,898][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:12:09,219][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:12:09,540][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:12:09,862][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:12:10,182][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:12:10,503][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:12:10,825][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:12:11,148][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:12:11,469][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:12:11,791][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:12:12,112][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:12:12,432][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:12:12,753][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:12:13,074][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:12:13,395][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:12:14,014][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:12:14,337][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:12:14,657][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:12:14,980][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:12:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:12:15,623][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:12:15,945][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:12:16,267][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:12:16,588][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:12:16,909][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:12:17,230][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:12:17,551][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:12:17,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:12:18,532][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:12:19,347][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:12:19,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:12:19,351][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:12:20,008][__main__][INFO] - Iteration 75 took 27s (11.72% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 7h 0m 17s. Estimated total time: 7h 37m 33s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 45s, 500 more iterations: 3h 48m 46s. [2026-03-25 16:12:20,010][__main__][INFO] - Starting iteration 75. [2026-03-25 16:12:20,013][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:12:20,014][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:12:23,264][__main__][INFO] - Number of regex retries in iteration 75: 0 [2026-03-25 16:12:23,264][__main__][INFO] - agents played in iteration 75 are Alice, Bob [2026-03-25 16:12:23,867][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:12:24,534][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:12:24,826][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:12:25,148][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:12:25,470][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:12:25,791][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:12:26,113][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:12:26,435][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:12:26,756][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:12:27,078][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:12:27,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:12:27,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:12:28,042][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:12:28,363][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:12:28,685][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:12:29,007][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:12:29,328][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:12:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:12:29,970][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:12:30,291][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:12:30,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:12:30,933][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:12:31,254][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:12:31,575][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:12:31,897][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:12:32,218][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:12:32,540][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:12:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:12:33,183][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:12:33,503][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:12:33,824][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:12:34,145][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:12:34,466][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:12:34,788][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:12:35,109][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:12:35,429][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:12:35,750][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:12:36,073][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:12:36,394][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:12:36,715][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:12:37,036][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:12:37,357][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:12:37,678][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:12:38,000][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:12:38,322][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:12:38,643][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:12:38,964][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:12:39,287][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:12:39,608][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:12:39,929][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:12:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:12:40,572][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:12:40,893][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:12:41,514][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:12:41,836][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:12:42,158][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:12:42,478][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:12:42,799][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:12:43,119][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:12:43,442][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:12:43,763][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:12:44,084][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:12:44,405][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:12:44,727][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:12:45,049][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:12:45,370][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:12:46,032][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:12:46,774][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:12:46,776][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:12:46,778][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:12:47,429][__main__][INFO] - Iteration 76 took 27s (11.86% Gen, 85.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 59m 13s. Estimated total time: 7h 36m 56s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 28s. [2026-03-25 16:12:47,431][__main__][INFO] - Starting iteration 76. [2026-03-25 16:12:47,434][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:12:47,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:12:50,655][__main__][INFO] - Number of regex retries in iteration 76: 0 [2026-03-25 16:12:50,656][__main__][INFO] - agents played in iteration 76 are Alice, Bob [2026-03-25 16:12:51,251][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:12:51,917][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:12:52,209][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:12:52,532][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:12:52,854][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:12:53,176][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:12:53,497][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:12:53,819][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:12:54,142][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:12:54,466][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:12:54,789][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:12:55,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:12:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:12:55,753][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:12:56,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:12:56,397][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:12:56,718][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:12:57,040][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:12:57,361][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:12:57,682][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:12:58,004][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:12:58,326][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:12:58,648][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:12:58,970][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:12:59,292][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:12:59,614][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:12:59,935][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:13:00,256][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:13:00,578][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:13:00,900][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:13:01,221][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:13:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:13:01,863][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:13:02,185][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:13:02,506][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:13:02,828][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:13:03,150][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:13:03,472][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:13:03,794][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:13:04,116][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:13:04,438][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:13:04,759][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:13:05,080][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:13:05,401][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:13:05,723][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:13:06,046][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:13:06,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:13:06,690][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:13:07,011][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:13:07,333][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:13:07,653][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:13:07,974][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:13:08,295][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:13:08,914][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:13:09,236][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:13:09,558][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:13:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:13:10,202][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:13:10,524][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:13:10,845][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:13:11,167][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:13:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:13:11,809][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:13:12,130][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:13:12,450][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:13:12,772][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:13:13,433][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:13:14,218][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:13:14,220][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:13:14,222][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:13:14,876][__main__][INFO] - Iteration 77 took 27s (11.74% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 59m 12s. Estimated total time: 7h 37m 23s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 41s. [2026-03-25 16:13:14,878][__main__][INFO] - Starting iteration 77. [2026-03-25 16:13:14,882][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:13:14,882][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:13:18,109][__main__][INFO] - Number of regex retries in iteration 77: 0 [2026-03-25 16:13:18,110][__main__][INFO] - agents played in iteration 77 are Alice, Bob [2026-03-25 16:13:18,700][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:13:19,368][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:13:19,660][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:13:19,982][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:13:20,303][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:13:20,627][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:13:20,948][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:13:21,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:13:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:13:21,911][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:13:22,232][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:13:22,554][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:13:22,874][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:13:23,195][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:13:23,517][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:13:23,838][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:13:24,159][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:13:24,481][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:13:24,803][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:13:25,126][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:13:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:13:25,769][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:13:26,091][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:13:26,413][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:13:26,735][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:13:27,058][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:13:27,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:13:27,702][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:13:28,023][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:13:28,344][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:13:28,665][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:13:28,986][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:13:29,307][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:13:29,629][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:13:29,950][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:13:30,272][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:13:30,593][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:13:30,915][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:13:31,236][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:13:31,557][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:13:31,878][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:13:32,200][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:13:32,522][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:13:32,843][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:13:33,165][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:13:33,486][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:13:33,807][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:13:34,128][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:13:34,450][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:13:34,772][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:13:35,094][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:13:35,415][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:13:35,738][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:13:36,360][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:13:36,682][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:13:37,004][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:13:37,325][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:13:37,647][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:13:37,969][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:13:38,291][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:13:38,612][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:13:38,933][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:13:39,255][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:13:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:13:39,899][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:13:40,220][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:13:40,881][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:13:41,618][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:13:41,620][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:13:41,622][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:13:42,262][__main__][INFO] - Iteration 78 took 27s (11.79% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 57m 43s. Estimated total time: 7h 36m 21s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 10s. [2026-03-25 16:13:42,264][__main__][INFO] - Starting iteration 78. [2026-03-25 16:13:42,268][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:13:42,268][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:13:45,507][__main__][INFO] - Number of regex retries in iteration 78: 0 [2026-03-25 16:13:45,508][__main__][INFO] - agents played in iteration 78 are Alice, Bob [2026-03-25 16:13:46,102][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:13:46,765][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:13:47,057][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:13:47,378][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:13:47,701][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:13:48,022][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:13:48,343][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:13:48,665][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:13:48,986][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:13:49,306][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:13:49,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:13:49,950][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:13:50,272][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:13:50,593][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:13:50,913][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:13:51,235][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:13:51,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:13:51,878][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:13:52,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:13:52,521][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:13:52,843][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:13:53,164][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:13:53,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:13:53,808][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:13:54,131][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:13:54,453][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:13:54,773][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:13:55,094][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:13:55,415][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:13:55,736][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:13:56,058][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:13:56,380][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:13:56,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:13:57,023][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:13:57,346][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:13:57,667][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:13:57,989][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:13:58,311][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:13:58,632][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:13:58,954][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:13:59,275][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:13:59,597][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:13:59,918][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:14:00,239][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:14:00,560][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:14:00,881][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:14:01,202][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:14:01,525][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:14:01,847][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:14:02,169][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:14:02,489][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:14:02,811][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:14:03,132][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:14:03,749][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:14:04,071][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:14:04,393][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:14:04,715][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:14:05,036][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:14:05,357][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:14:05,678][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:14:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:14:06,322][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:14:06,645][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:14:06,966][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:14:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:14:07,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:14:08,266][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:14:09,010][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:14:09,013][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:14:09,014][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:14:09,659][__main__][INFO] - Iteration 79 took 27s (11.83% Gen, 85.81% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 57m 27s. Estimated total time: 7h 36m 32s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 16s. [2026-03-25 16:14:09,661][__main__][INFO] - Starting iteration 79. [2026-03-25 16:14:09,664][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:14:09,665][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:14:12,897][__main__][INFO] - Number of regex retries in iteration 79: 0 [2026-03-25 16:14:12,898][__main__][INFO] - agents played in iteration 79 are Alice, Bob [2026-03-25 16:14:13,493][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:14:14,157][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:14:14,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:14:14,771][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:14:15,093][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:14:15,415][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:14:15,736][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:14:16,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:14:16,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:14:16,701][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:14:17,023][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:14:17,345][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:14:17,667][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:14:17,988][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:14:18,310][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:14:18,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:14:18,954][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:14:19,275][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:14:19,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:14:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:14:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:14:20,561][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:14:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:14:21,204][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:14:21,526][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:14:21,847][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:14:22,168][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:14:22,491][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:14:22,812][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:14:23,133][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:14:23,454][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:14:23,775][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:14:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:14:24,419][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:14:24,741][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:14:25,063][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:14:25,385][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:14:25,708][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:14:26,029][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:14:26,351][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:14:26,673][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:14:26,996][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:14:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:14:27,640][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:14:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:14:28,283][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:14:28,605][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:14:28,927][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:14:29,249][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:14:29,571][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:14:29,892][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:14:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:14:30,535][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:14:31,152][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:14:31,474][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:14:31,795][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:14:32,116][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:14:32,436][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:14:32,757][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:14:33,078][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:14:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:14:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:14:34,043][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:14:34,364][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:14:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:14:35,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:14:35,665][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:14:36,416][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:14:36,418][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:14:36,419][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:14:37,065][__main__][INFO] - Iteration 80 took 27s (11.80% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 57m 8s. Estimated total time: 7h 36m 41s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 20s. [2026-03-25 16:14:37,067][__main__][INFO] - Starting iteration 80. [2026-03-25 16:14:37,070][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:14:37,071][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:14:40,294][__main__][INFO] - Number of regex retries in iteration 80: 0 [2026-03-25 16:14:40,294][__main__][INFO] - agents played in iteration 80 are Alice, Bob [2026-03-25 16:14:40,888][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:14:41,551][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:14:41,842][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:14:42,164][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:14:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:14:42,807][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:14:43,128][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:14:43,449][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:14:43,772][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:14:44,093][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:14:44,415][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:14:44,737][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:14:45,058][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:14:45,380][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:14:45,702][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:14:46,023][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:14:46,343][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:14:46,664][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:14:46,985][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:14:47,306][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:14:47,627][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:14:47,948][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:14:48,269][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:14:48,590][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:14:48,911][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:14:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:14:49,555][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:14:49,875][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:14:50,197][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:14:50,519][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:14:50,840][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:14:51,162][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:14:51,484][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:14:51,805][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:14:52,127][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:14:52,448][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:14:52,769][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:14:53,091][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:14:53,413][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:14:53,734][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:14:54,056][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:14:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:14:54,698][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:14:55,019][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:14:55,341][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:14:55,663][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:14:55,983][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:14:56,304][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:14:56,625][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:14:56,947][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:14:57,270][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:14:57,592][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:14:57,913][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:14:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:14:58,853][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:14:59,174][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:14:59,497][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:14:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:15:00,139][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:15:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:15:00,783][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:15:01,105][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:15:01,427][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:15:01,750][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:15:02,072][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:15:02,395][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:15:03,052][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:15:03,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:15:03,803][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:15:03,804][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:15:04,452][__main__][INFO] - Iteration 81 took 27s (11.77% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 56m 23s. Estimated total time: 7h 36m 23s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 11s. [2026-03-25 16:15:04,455][__main__][INFO] - Starting iteration 81. [2026-03-25 16:15:04,458][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:15:04,459][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:15:07,690][__main__][INFO] - Number of regex retries in iteration 81: 0 [2026-03-25 16:15:07,691][__main__][INFO] - agents played in iteration 81 are Alice, Bob [2026-03-25 16:15:08,293][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:15:08,959][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:15:09,251][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:15:09,573][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:15:09,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:15:10,216][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:15:10,537][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:15:10,858][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:15:11,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:15:11,502][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:15:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:15:12,145][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:15:12,466][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:15:12,787][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:15:13,108][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:15:13,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:15:13,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:15:14,071][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:15:14,393][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:15:14,714][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:15:15,034][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:15:15,357][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:15:15,677][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:15:16,000][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:15:16,321][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:15:16,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:15:16,965][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:15:17,286][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:15:17,608][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:15:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:15:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:15:18,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:15:18,895][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:15:19,216][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:15:19,538][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:15:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:15:20,183][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:15:20,504][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:15:20,826][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:15:21,148][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:15:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:15:21,790][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:15:22,111][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:15:22,432][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:15:22,754][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:15:23,075][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:15:23,397][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:15:23,718][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:15:24,039][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:15:24,360][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:15:24,681][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:15:25,004][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:15:25,325][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:15:25,944][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:15:26,265][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:15:26,587][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:15:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:15:27,231][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:15:27,554][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:15:27,877][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:15:28,198][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:15:28,519][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:15:28,840][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:15:29,162][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:15:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:15:29,804][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:15:30,460][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:15:31,215][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:15:31,217][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:15:31,219][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:15:31,878][__main__][INFO] - Iteration 82 took 27s (11.79% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 56m 33s. Estimated total time: 7h 37m 0s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 30s. [2026-03-25 16:15:31,880][__main__][INFO] - Starting iteration 82. [2026-03-25 16:15:31,883][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:15:31,884][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:15:35,134][__main__][INFO] - Number of regex retries in iteration 82: 0 [2026-03-25 16:15:35,135][__main__][INFO] - agents played in iteration 82 are Alice, Bob [2026-03-25 16:15:35,729][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:15:36,393][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:15:36,685][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:15:37,007][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:15:37,329][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:15:37,651][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:15:37,972][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:15:38,293][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:15:38,615][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:15:38,937][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:15:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:15:39,579][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:15:39,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:15:40,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:15:40,543][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:15:40,866][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:15:41,188][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:15:41,509][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:15:41,831][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:15:42,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:15:42,474][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:15:42,796][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:15:43,117][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:15:43,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:15:43,760][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:15:44,081][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:15:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:15:44,724][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:15:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:15:45,369][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:15:45,690][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:15:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:15:46,333][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:15:46,654][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:15:46,977][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:15:47,299][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:15:47,621][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:15:47,943][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:15:48,265][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:15:48,587][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:15:48,908][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:15:49,231][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:15:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:15:49,874][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:15:50,194][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:15:50,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:15:50,837][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:15:51,158][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:15:51,480][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:15:51,801][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:15:52,122][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:15:52,444][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:15:52,766][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:15:53,385][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:15:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:15:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:15:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:15:54,671][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:15:54,993][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:15:55,315][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:15:55,637][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:15:55,959][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:15:56,280][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:15:56,601][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:15:56,924][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:15:57,246][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:15:57,906][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:15:58,672][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:15:58,674][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:15:58,676][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:15:59,322][__main__][INFO] - Iteration 83 took 27s (11.85% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 56m 25s. Estimated total time: 7h 37m 20s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 40s. [2026-03-25 16:15:59,324][__main__][INFO] - Starting iteration 83. [2026-03-25 16:15:59,327][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:15:59,328][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:16:02,612][__main__][INFO] - Number of regex retries in iteration 83: 0 [2026-03-25 16:16:02,613][__main__][INFO] - agents played in iteration 83 are Alice, Bob [2026-03-25 16:16:03,269][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:16:03,932][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:16:04,224][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:16:04,545][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:16:04,866][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:16:05,187][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:16:05,508][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:16:05,831][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:16:06,152][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:16:06,474][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:16:06,795][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:16:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:16:07,438][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:16:07,759][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:16:08,081][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:16:08,403][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:16:08,725][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:16:09,046][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:16:09,368][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:16:09,689][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:16:10,011][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:16:10,333][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:16:10,654][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:16:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:16:11,299][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:16:11,619][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:16:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:16:12,261][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:16:12,583][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:16:12,905][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:16:13,227][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:16:13,548][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:16:13,869][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:16:14,190][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:16:14,512][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:16:14,834][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:16:15,155][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:16:15,477][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:16:15,798][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:16:16,119][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:16:16,440][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:16:16,761][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:16:17,084][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:16:17,405][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:16:17,727][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:16:18,047][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:16:18,368][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:16:18,690][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:16:19,012][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:16:19,334][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:16:19,655][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:16:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:16:20,299][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:16:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:16:21,239][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:16:21,561][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:16:21,882][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:16:22,204][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:16:22,526][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:16:22,848][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:16:23,170][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:16:23,491][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:16:23,812][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:16:24,133][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:16:24,454][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:16:24,775][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:16:25,441][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:16:26,186][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:16:26,188][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:16:26,190][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:16:26,836][__main__][INFO] - Iteration 84 took 27s (11.94% Gen, 85.70% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 57m 6s. Estimated total time: 7h 38m 29s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 50s, 500 more iterations: 3h 49m 14s. [2026-03-25 16:16:26,838][__main__][INFO] - Starting iteration 84. [2026-03-25 16:16:26,841][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:16:26,841][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:16:30,098][__main__][INFO] - Number of regex retries in iteration 84: 0 [2026-03-25 16:16:30,099][__main__][INFO] - agents played in iteration 84 are Alice, Bob [2026-03-25 16:16:30,715][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:16:31,378][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:16:31,670][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:16:31,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:16:32,313][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:16:32,634][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:16:32,955][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:16:33,277][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:16:33,599][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:16:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:16:34,241][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:16:34,562][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:16:34,883][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:16:35,204][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:16:35,527][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:16:35,849][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:16:36,171][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:16:36,493][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:16:36,814][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:16:37,136][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:16:37,456][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:16:37,778][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:16:38,100][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:16:38,422][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:16:38,744][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:16:39,066][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:16:39,388][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:16:39,711][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:16:40,033][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:16:40,355][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:16:40,676][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:16:40,997][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:16:41,319][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:16:41,641][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:16:41,962][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:16:42,285][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:16:42,607][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:16:42,929][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:16:43,250][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:16:43,572][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:16:43,893][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:16:44,215][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:16:44,537][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:16:44,859][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:16:45,180][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:16:45,502][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:16:45,824][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:16:46,147][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:16:46,468][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:16:46,790][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:16:47,111][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:16:47,432][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:16:47,753][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:16:48,371][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:16:48,692][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:16:49,014][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:16:49,337][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:16:49,659][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:16:49,980][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:16:50,301][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:16:50,622][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:16:50,944][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:16:51,266][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:16:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:16:51,910][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:16:52,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:16:52,889][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:16:53,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:16:53,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:16:53,639][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:16:54,287][__main__][INFO] - Iteration 85 took 27s (11.87% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 55m 37s. Estimated total time: 7h 37m 27s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 43s. [2026-03-25 16:16:54,289][__main__][INFO] - Starting iteration 85. [2026-03-25 16:16:54,293][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:16:54,293][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:16:57,556][__main__][INFO] - Number of regex retries in iteration 85: 0 [2026-03-25 16:16:57,557][__main__][INFO] - agents played in iteration 85 are Alice, Bob [2026-03-25 16:16:58,162][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:16:58,827][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:16:59,119][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:16:59,440][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:16:59,762][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:17:00,083][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:17:00,405][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:17:00,727][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:17:01,048][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:17:01,369][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:17:01,690][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:17:02,012][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:17:02,333][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:17:02,656][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:17:02,977][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:17:03,298][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:17:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:17:03,942][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:17:04,263][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:17:04,585][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:17:04,907][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:17:05,229][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:17:05,551][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:17:05,874][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:17:06,196][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:17:06,518][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:17:06,840][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:17:07,163][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:17:07,485][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:17:07,807][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:17:08,128][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:17:08,451][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:17:08,773][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:17:09,094][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:17:09,416][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:17:09,738][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:17:10,060][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:17:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:17:10,704][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:17:11,026][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:17:11,347][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:17:11,670][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:17:11,992][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:17:12,313][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:17:12,635][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:17:12,956][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:17:13,278][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:17:13,600][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:17:13,922][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:17:14,244][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:17:14,565][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:17:14,887][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:17:15,208][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:17:15,826][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:17:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:17:16,471][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:17:16,793][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:17:17,114][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:17:17,435][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:17:17,757][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:17:18,079][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:17:18,400][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:17:18,721][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:17:19,042][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:17:19,363][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:17:19,686][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:17:20,343][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:17:21,084][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:17:21,086][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:17:21,088][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:17:21,736][__main__][INFO] - Iteration 86 took 27s (11.89% Gen, 85.74% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 55m 7s. Estimated total time: 7h 37m 24s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 42s. [2026-03-25 16:17:21,738][__main__][INFO] - Starting iteration 86. [2026-03-25 16:17:21,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:17:21,742][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:17:25,004][__main__][INFO] - Number of regex retries in iteration 86: 0 [2026-03-25 16:17:25,004][__main__][INFO] - agents played in iteration 86 are Alice, Bob [2026-03-25 16:17:25,572][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:17:26,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:17:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:17:26,849][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:17:27,171][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:17:27,492][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:17:27,814][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:17:28,135][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:17:28,458][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:17:28,781][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:17:29,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:17:29,426][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:17:29,748][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:17:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:17:30,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:17:30,712][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:17:31,034][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:17:31,355][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:17:31,677][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:17:31,998][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:17:32,321][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:17:32,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:17:32,964][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:17:33,285][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:17:33,607][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:17:33,928][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:17:34,250][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:17:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:17:34,894][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:17:35,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:17:35,538][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:17:35,860][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:17:36,182][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:17:36,503][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:17:36,825][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:17:37,147][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:17:37,468][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:17:37,789][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:17:38,110][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:17:38,432][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:17:38,753][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:17:39,075][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:17:39,397][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:17:39,719][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:17:40,041][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:17:40,363][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:17:40,684][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:17:41,006][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:17:41,328][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:17:41,648][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:17:41,969][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:17:42,290][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:17:42,612][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:17:43,230][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:17:43,552][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:17:43,874][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:17:44,197][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:17:44,518][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:17:44,840][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:17:45,162][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:17:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:17:45,806][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:17:46,129][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:17:46,451][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:17:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:17:47,094][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:17:47,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:17:48,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:17:48,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:17:48,499][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:17:49,145][__main__][INFO] - Iteration 87 took 27s (11.90% Gen, 85.73% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 53m 59s. Estimated total time: 7h 36m 44s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 22s. [2026-03-25 16:17:49,147][__main__][INFO] - Starting iteration 87. [2026-03-25 16:17:49,150][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:17:49,151][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:17:52,400][__main__][INFO] - Number of regex retries in iteration 87: 0 [2026-03-25 16:17:52,400][__main__][INFO] - agents played in iteration 87 are Alice, Bob [2026-03-25 16:17:52,998][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:17:53,661][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:17:53,952][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:17:54,275][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:17:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:17:54,916][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:17:55,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:17:55,559][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:17:55,881][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:17:56,203][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:17:56,525][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:17:56,846][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:17:57,169][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:17:57,491][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:17:57,813][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:17:58,135][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:17:58,456][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:17:58,778][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:17:59,101][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:17:59,423][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:17:59,744][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:18:00,066][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:18:00,388][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:18:00,711][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:18:01,032][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:18:01,354][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:18:01,675][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:18:01,996][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:18:02,319][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:18:02,640][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:18:02,961][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:18:03,283][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:18:03,605][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:18:03,927][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:18:04,249][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:18:04,571][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:18:04,892][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:18:05,213][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:18:05,536][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:18:05,858][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:18:06,180][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:18:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:18:06,823][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:18:07,145][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:18:07,466][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:18:07,789][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:18:08,110][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:18:08,432][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:18:08,753][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:18:09,074][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:18:09,395][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:18:09,716][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:18:10,038][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:18:10,655][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:18:10,977][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:18:11,299][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:18:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:18:11,943][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:18:12,265][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:18:12,587][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:18:12,909][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:18:13,232][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:18:13,555][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:18:13,878][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:18:14,200][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:18:14,522][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:18:15,182][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:18:15,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:18:15,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:18:15,938][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:18:16,706][__main__][INFO] - Iteration 88 took 27s (11.79% Gen, 85.41% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 56m 5s. Estimated total time: 7h 39m 17s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 55s, 500 more iterations: 3h 49m 38s. [2026-03-25 16:18:16,708][__main__][INFO] - Starting iteration 88. [2026-03-25 16:18:16,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:18:16,712][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:18:19,955][__main__][INFO] - Number of regex retries in iteration 88: 0 [2026-03-25 16:18:19,956][__main__][INFO] - agents played in iteration 88 are Alice, Bob [2026-03-25 16:18:20,544][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:18:21,210][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:18:21,501][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:18:21,822][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:18:22,145][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:18:22,466][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:18:22,787][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:18:23,109][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:18:23,431][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:18:23,753][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:18:24,075][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:18:24,397][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:18:24,719][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:18:25,040][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:18:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:18:25,684][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:18:26,006][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:18:26,328][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:18:26,649][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:18:26,970][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:18:27,293][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:18:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:18:27,936][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:18:28,257][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:18:28,579][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:18:28,900][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:18:29,222][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:18:29,543][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:18:29,864][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:18:30,186][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:18:30,508][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:18:30,829][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:18:31,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:18:31,472][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:18:31,794][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:18:32,115][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:18:32,438][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:18:32,759][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:18:33,081][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:18:33,402][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:18:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:18:34,045][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:18:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:18:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:18:35,009][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:18:35,331][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:18:35,655][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:18:35,978][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:18:36,301][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:18:36,624][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:18:36,946][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:18:37,268][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:18:37,590][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:18:38,207][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:18:38,530][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:18:38,851][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:18:39,173][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:18:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:18:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:18:40,138][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:18:40,460][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:18:40,781][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:18:41,103][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:18:41,426][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:18:41,749][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:18:42,071][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:18:42,727][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:18:43,467][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:18:43,470][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:18:43,471][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:18:44,118][__main__][INFO] - Iteration 89 took 27s (11.84% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 53m 7s. Estimated total time: 7h 36m 47s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 23s. [2026-03-25 16:18:44,120][__main__][INFO] - Starting iteration 89. [2026-03-25 16:18:44,123][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:18:44,123][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:18:47,359][__main__][INFO] - Number of regex retries in iteration 89: 0 [2026-03-25 16:18:47,360][__main__][INFO] - agents played in iteration 89 are Alice, Bob [2026-03-25 16:18:47,942][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:18:48,605][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:18:48,897][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:18:49,220][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:18:49,541][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:18:49,862][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:18:50,184][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:18:50,506][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:18:50,827][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:18:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:18:51,471][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:18:51,792][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:18:52,114][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:18:52,435][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:18:52,757][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:18:53,079][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:18:53,401][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:18:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:18:54,045][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:18:54,367][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:18:54,689][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:18:55,011][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:18:55,332][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:18:55,653][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:18:55,975][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:18:56,296][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:18:56,618][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:18:56,940][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:18:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:18:57,585][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:18:57,907][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:18:58,229][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:18:58,549][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:18:58,870][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:18:59,191][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:18:59,512][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:18:59,833][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:19:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:19:00,475][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:19:00,796][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:19:01,118][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:19:01,439][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:19:01,760][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:19:02,082][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:19:02,403][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:19:02,726][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:19:03,047][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:19:03,369][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:19:03,690][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:19:04,011][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:19:04,332][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:19:04,654][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:19:04,975][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:19:05,595][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:19:05,916][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:19:06,237][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:19:06,558][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:19:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:19:07,202][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:19:07,524][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:19:07,846][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:19:08,168][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:19:08,489][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:19:08,810][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:19:09,130][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:19:09,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:19:10,114][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:19:10,858][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:19:10,860][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:19:10,862][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:19:11,506][__main__][INFO] - Iteration 90 took 27s (11.82% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 52m 17s. Estimated total time: 7h 36m 24s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 12s. [2026-03-25 16:19:11,509][__main__][INFO] - Starting iteration 90. [2026-03-25 16:19:11,512][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:19:11,512][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:19:14,717][__main__][INFO] - Number of regex retries in iteration 90: 0 [2026-03-25 16:19:14,717][__main__][INFO] - agents played in iteration 90 are Alice, Bob [2026-03-25 16:19:15,318][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:19:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:19:16,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:19:16,596][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:19:16,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:19:17,241][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:19:17,563][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:19:17,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:19:18,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:19:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:19:18,848][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:19:19,170][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:19:19,491][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:19:19,812][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:19:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:19:20,454][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:19:20,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:19:21,098][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:19:21,420][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:19:21,742][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:19:22,064][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:19:22,386][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:19:22,709][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:19:23,031][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:19:23,352][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:19:23,674][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:19:23,996][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:19:24,317][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:19:24,638][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:19:24,959][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:19:25,280][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:19:25,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:19:25,923][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:19:26,245][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:19:26,567][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:19:26,889][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:19:27,210][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:19:27,532][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:19:27,854][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:19:28,176][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:19:28,498][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:19:28,819][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:19:29,141][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:19:29,462][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:19:29,783][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:19:30,105][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:19:30,428][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:19:30,750][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:19:31,072][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:19:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:19:31,715][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:19:32,036][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:19:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:19:32,976][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:19:33,297][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:19:33,619][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:19:33,941][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:19:34,262][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:19:34,584][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:19:34,905][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:19:35,227][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:19:35,550][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:19:35,872][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:19:36,194][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:19:36,515][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:19:36,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:19:37,495][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:19:38,245][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:19:38,248][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:19:38,249][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:19:38,928][__main__][INFO] - Iteration 91 took 27s (11.69% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 52m 23s. Estimated total time: 7h 36m 57s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 28s. [2026-03-25 16:19:38,931][__main__][INFO] - Starting iteration 91. [2026-03-25 16:19:38,934][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:19:38,934][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:19:42,130][__main__][INFO] - Number of regex retries in iteration 91: 0 [2026-03-25 16:19:42,130][__main__][INFO] - agents played in iteration 91 are Alice, Bob [2026-03-25 16:19:42,718][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:19:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:19:43,678][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:19:43,999][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:19:44,321][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:19:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:19:44,963][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:19:45,284][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:19:45,606][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:19:45,927][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:19:46,249][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:19:46,570][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:19:46,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:19:47,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:19:47,538][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:19:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:19:48,179][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:19:48,500][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:19:48,820][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:19:49,141][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:19:49,462][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:19:49,783][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:19:50,104][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:19:50,425][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:19:50,747][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:19:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:19:51,391][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:19:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:19:52,034][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:19:52,357][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:19:52,678][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:19:52,999][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:19:53,320][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:19:53,641][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:19:53,963][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:19:54,285][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:19:54,606][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:19:54,928][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:19:55,249][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:19:55,571][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:19:55,894][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:19:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:19:56,536][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:19:56,858][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:19:57,179][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:19:57,502][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:19:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:19:58,143][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:19:58,465][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:19:58,786][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:19:59,108][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:19:59,430][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:19:59,752][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:20:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:20:00,694][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:20:01,015][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:20:01,337][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:20:01,659][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:20:01,981][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:20:02,302][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:20:02,625][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:20:02,947][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:20:03,269][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:20:03,590][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:20:03,911][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:20:04,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:20:04,893][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:20:05,633][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:20:05,635][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:20:05,637][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:20:06,294][__main__][INFO] - Iteration 92 took 27s (11.68% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 50m 59s. Estimated total time: 7h 36m 1s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 36s, 500 more iterations: 3h 48m 0s. [2026-03-25 16:20:06,296][__main__][INFO] - Starting iteration 92. [2026-03-25 16:20:06,299][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:20:06,300][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:20:09,499][__main__][INFO] - Number of regex retries in iteration 92: 0 [2026-03-25 16:20:09,500][__main__][INFO] - agents played in iteration 92 are Alice, Bob [2026-03-25 16:20:10,089][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:20:10,758][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:20:11,050][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:20:11,371][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:20:11,693][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:20:12,016][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:20:12,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:20:12,659][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:20:12,980][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:20:13,302][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:20:13,623][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:20:13,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:20:14,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:20:14,590][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:20:14,912][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:20:15,235][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:20:15,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:20:15,878][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:20:16,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:20:16,519][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:20:16,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:20:17,163][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:20:17,485][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:20:17,807][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:20:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:20:18,451][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:20:18,772][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:20:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:20:19,414][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:20:19,735][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:20:20,056][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:20:20,377][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:20:20,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:20:21,020][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:20:21,342][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:20:21,663][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:20:21,984][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:20:22,306][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:20:22,627][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:20:22,948][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:20:23,270][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:20:23,592][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:20:23,914][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:20:24,237][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:20:24,558][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:20:24,880][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:20:25,202][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:20:25,524][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:20:25,846][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:20:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:20:26,490][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:20:26,811][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:20:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:20:27,752][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:20:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:20:28,395][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:20:28,716][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:20:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:20:29,359][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:20:29,681][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:20:30,002][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:20:30,324][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:20:30,646][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:20:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:20:31,289][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:20:31,610][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:20:32,272][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:20:33,015][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:20:33,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:20:33,020][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:20:33,673][__main__][INFO] - Iteration 93 took 27s (11.69% Gen, 85.92% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 50m 45s. Estimated total time: 7h 36m 14s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 7s. [2026-03-25 16:20:33,675][__main__][INFO] - Starting iteration 93. [2026-03-25 16:20:33,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:20:33,679][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:20:36,889][__main__][INFO] - Number of regex retries in iteration 93: 0 [2026-03-25 16:20:36,890][__main__][INFO] - agents played in iteration 93 are Alice, Bob [2026-03-25 16:20:37,483][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:20:38,151][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:20:38,443][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:20:38,765][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:20:39,086][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:20:39,408][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:20:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:20:40,050][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:20:40,371][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:20:40,692][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:20:41,013][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:20:41,335][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:20:41,656][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:20:41,977][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:20:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:20:42,622][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:20:42,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:20:43,265][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:20:43,588][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:20:43,910][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:20:44,231][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:20:44,553][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:20:44,874][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:20:45,197][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:20:45,519][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:20:45,841][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:20:46,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:20:46,484][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:20:46,806][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:20:47,128][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:20:47,450][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:20:47,771][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:20:48,092][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:20:48,414][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:20:48,735][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:20:49,056][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:20:49,377][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:20:49,699][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:20:50,021][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:20:50,343][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:20:50,664][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:20:50,986][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:20:51,308][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:20:51,630][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:20:51,951][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:20:52,273][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:20:52,595][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:20:52,917][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:20:53,238][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:20:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:20:53,881][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:20:54,203][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:20:54,524][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:20:55,145][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:20:55,466][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:20:55,788][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:20:56,110][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:20:56,432][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:20:56,753][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:20:57,074][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:20:57,396][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:20:57,719][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:20:58,040][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:20:58,361][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:20:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:20:59,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:20:59,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:21:00,399][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:21:00,402][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:21:00,403][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:21:01,050][__main__][INFO] - Iteration 94 took 27s (11.73% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 50m 15s. Estimated total time: 7h 36m 12s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 6s. [2026-03-25 16:21:01,052][__main__][INFO] - Starting iteration 94. [2026-03-25 16:21:01,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:21:01,056][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:21:04,286][__main__][INFO] - Number of regex retries in iteration 94: 0 [2026-03-25 16:21:04,287][__main__][INFO] - agents played in iteration 94 are Alice, Bob [2026-03-25 16:21:04,883][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:21:05,545][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:21:05,838][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:21:06,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:21:06,483][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:21:06,804][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:21:07,125][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:21:07,447][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:21:07,769][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:21:08,091][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:21:08,412][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:21:08,734][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:21:09,056][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:21:09,378][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:21:09,700][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:21:10,022][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:21:10,345][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:21:10,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:21:10,989][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:21:11,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:21:11,633][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:21:11,954][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:21:12,275][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:21:12,597][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:21:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:21:13,239][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:21:13,561][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:21:13,883][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:21:14,206][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:21:14,528][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:21:14,851][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:21:15,173][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:21:15,496][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:21:15,817][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:21:16,140][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:21:16,461][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:21:16,783][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:21:17,104][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:21:17,426][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:21:17,749][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:21:18,072][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:21:18,394][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:21:18,716][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:21:19,036][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:21:19,357][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:21:19,679][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:21:20,001][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:21:20,322][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:21:20,643][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:21:20,965][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:21:21,287][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:21:21,608][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:21:21,930][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:21:22,547][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:21:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:21:23,191][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:21:23,513][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:21:23,835][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:21:24,156][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:21:24,477][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:21:24,800][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:21:25,121][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:21:25,443][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:21:25,765][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:21:26,086][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:21:26,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:21:27,065][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:21:27,826][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:21:27,828][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:21:27,830][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:21:28,494][__main__][INFO] - Iteration 95 took 27s (11.78% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 50m 55s. Estimated total time: 7h 37m 19s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 43s, 500 more iterations: 3h 48m 39s. [2026-03-25 16:21:28,496][__main__][INFO] - Starting iteration 95. [2026-03-25 16:21:28,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:21:28,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:21:31,709][__main__][INFO] - Number of regex retries in iteration 95: 0 [2026-03-25 16:21:31,710][__main__][INFO] - agents played in iteration 95 are Alice, Bob [2026-03-25 16:21:32,303][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:21:32,967][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:21:33,260][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:21:33,583][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:21:33,904][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:21:34,227][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:21:34,549][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:21:34,872][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:21:35,195][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:21:35,517][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:21:35,838][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:21:36,160][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:21:36,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:21:36,803][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:21:37,124][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:21:37,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:21:37,767][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:21:38,088][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:21:38,409][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:21:38,730][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:21:39,051][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:21:39,373][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:21:39,696][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:21:40,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:21:40,339][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:21:40,660][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:21:40,983][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:21:41,304][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:21:41,626][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:21:41,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:21:42,270][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:21:42,592][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:21:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:21:43,236][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:21:43,559][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:21:43,882][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:21:44,204][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:21:44,525][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:21:44,846][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:21:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:21:45,489][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:21:45,810][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:21:46,132][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:21:46,454][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:21:46,776][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:21:47,097][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:21:47,419][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:21:47,740][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:21:48,062][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:21:48,385][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:21:48,707][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:21:49,029][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:21:49,351][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:21:49,970][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:21:50,292][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:21:50,613][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:21:50,935][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:21:51,257][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:21:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:21:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:21:52,223][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:21:52,545][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:21:52,867][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:21:53,190][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:21:53,512][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:21:53,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:21:54,491][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:21:55,235][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:21:55,237][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:21:55,239][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:21:55,930][__main__][INFO] - Iteration 96 took 27s (11.70% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 50m 20s. Estimated total time: 7h 37m 11s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 43s, 500 more iterations: 3h 48m 35s. [2026-03-25 16:21:55,932][__main__][INFO] - Starting iteration 96. [2026-03-25 16:21:55,935][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:21:55,936][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:21:59,174][__main__][INFO] - Number of regex retries in iteration 96: 0 [2026-03-25 16:21:59,175][__main__][INFO] - agents played in iteration 96 are Alice, Bob [2026-03-25 16:21:59,778][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:22:00,446][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:22:00,737][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:22:01,059][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:22:01,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:22:01,703][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:22:02,025][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:22:02,346][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:22:02,668][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:22:02,990][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:22:03,311][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:22:03,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:22:03,953][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:22:04,274][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:22:04,596][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:22:04,917][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:22:05,239][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:22:05,562][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:22:05,883][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:22:06,205][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:22:06,527][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:22:06,849][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:22:07,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:22:07,493][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:22:07,814][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:22:08,136][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:22:08,456][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:22:08,778][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:22:09,100][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:22:09,421][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:22:09,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:22:10,065][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:22:10,387][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:22:10,710][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:22:11,032][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:22:11,355][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:22:11,676][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:22:11,998][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:22:12,319][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:22:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:22:12,961][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:22:13,282][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:22:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:22:13,925][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:22:14,247][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:22:14,570][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:22:14,891][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:22:15,213][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:22:15,535][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:22:15,856][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:22:16,177][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:22:16,499][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:22:16,821][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:22:17,443][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:22:17,765][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:22:18,086][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:22:18,408][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:22:18,730][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:22:19,051][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:22:19,372][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:22:19,694][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:22:20,016][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:22:20,338][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:22:20,661][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:22:20,984][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:22:21,306][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:22:21,970][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:22:22,709][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:22:22,711][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:22:22,713][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:22:23,627][__main__][INFO] - Iteration 97 took 27s (11.70% Gen, 84.99% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 54m 13s. Estimated total time: 7h 41m 33s. Time estimates for 10 more iterations: 4m 36s, 100 more iterations: 46m 9s, 500 more iterations: 3h 50m 46s. [2026-03-25 16:22:23,630][__main__][INFO] - Starting iteration 97. [2026-03-25 16:22:23,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:22:23,633][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:22:26,843][__main__][INFO] - Number of regex retries in iteration 97: 0 [2026-03-25 16:22:26,843][__main__][INFO] - agents played in iteration 97 are Alice, Bob [2026-03-25 16:22:27,428][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:22:28,098][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:22:28,388][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:22:28,710][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:22:29,032][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:22:29,352][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:22:29,673][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:22:29,994][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:22:30,315][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:22:30,637][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:22:30,958][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:22:31,280][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:22:31,602][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:22:31,922][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:22:32,244][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:22:32,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:22:32,886][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:22:33,209][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:22:33,531][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:22:33,853][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:22:34,175][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:22:34,496][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:22:34,817][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:22:35,138][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:22:35,460][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:22:35,781][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:22:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:22:36,424][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:22:36,747][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:22:37,069][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:22:37,391][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:22:37,713][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:22:38,035][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:22:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:22:38,678][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:22:39,000][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:22:39,322][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:22:39,642][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:22:39,963][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:22:40,285][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:22:40,607][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:22:40,928][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:22:41,250][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:22:41,572][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:22:41,894][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:22:42,216][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:22:42,538][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:22:42,859][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:22:43,181][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:22:43,502][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:22:43,824][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:22:44,146][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:22:44,468][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:22:45,090][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:22:45,412][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:22:45,733][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:22:46,054][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:22:46,376][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:22:46,698][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:22:47,019][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:22:47,340][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:22:47,661][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:22:47,983][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:22:48,305][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:22:48,627][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:22:48,948][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:22:49,611][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:22:50,351][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:22:50,353][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:22:50,355][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:22:51,252][__main__][INFO] - Iteration 98 took 27s (11.62% Gen, 85.12% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 52m 33s. Estimated total time: 7h 40m 20s. Time estimates for 10 more iterations: 4m 36s, 100 more iterations: 46m 2s, 500 more iterations: 3h 50m 10s. [2026-03-25 16:22:51,255][__main__][INFO] - Starting iteration 98. [2026-03-25 16:22:51,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:22:51,259][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:22:54,505][__main__][INFO] - Number of regex retries in iteration 98: 0 [2026-03-25 16:22:54,506][__main__][INFO] - agents played in iteration 98 are Alice, Bob [2026-03-25 16:22:55,104][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:22:55,775][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:22:56,067][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:22:56,390][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:22:56,712][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:22:57,033][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:22:57,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:22:57,676][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:22:57,998][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:22:58,320][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:22:58,641][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:22:58,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:22:59,283][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:22:59,604][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:22:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:23:00,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:23:00,570][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:23:00,892][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:23:01,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:23:01,536][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:23:01,858][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:23:02,180][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:23:02,501][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:23:02,822][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:23:03,144][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:23:03,465][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:23:03,787][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:23:04,108][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:23:04,430][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:23:04,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:23:05,073][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:23:05,395][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:23:05,718][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:23:06,041][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:23:06,362][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:23:06,684][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:23:07,006][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:23:07,328][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:23:07,651][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:23:07,972][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:23:08,294][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:23:08,615][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:23:08,936][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:23:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:23:09,579][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:23:09,900][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:23:10,221][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:23:10,542][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:23:10,864][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:23:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:23:11,508][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:23:11,830][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:23:12,151][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:23:12,773][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:23:13,094][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:23:13,415][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:23:13,737][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:23:14,058][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:23:14,380][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:23:14,702][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:23:15,023][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:23:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:23:15,666][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:23:15,988][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:23:16,310][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:23:16,632][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:23:17,295][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:23:18,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:23:18,038][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:23:18,040][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:23:18,691][__main__][INFO] - Iteration 99 took 27s (11.84% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 48m 59s. Estimated total time: 7h 37m 13s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 43s, 500 more iterations: 3h 48m 36s. [2026-03-25 16:23:18,693][__main__][INFO] - Starting iteration 99. [2026-03-25 16:23:18,696][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:23:18,697][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:23:21,905][__main__][INFO] - Number of regex retries in iteration 99: 0 [2026-03-25 16:23:21,905][__main__][INFO] - agents played in iteration 99 are Alice, Bob [2026-03-25 16:23:22,500][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:23:23,167][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:23:23,459][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:23:23,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:23:24,104][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:23:24,425][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:23:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:23:25,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:23:25,390][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:23:25,712][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:23:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:23:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:23:26,678][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:23:26,998][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:23:27,320][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:23:27,642][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:23:27,963][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:23:28,284][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:23:28,606][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:23:28,928][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:23:29,250][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:23:29,571][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:23:29,893][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:23:30,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:23:30,537][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:23:30,858][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:23:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:23:31,502][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:23:31,824][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:23:32,147][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:23:32,470][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:23:32,792][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:23:33,115][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:23:33,437][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:23:33,760][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:23:34,082][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:23:34,404][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:23:34,726][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:23:35,048][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:23:35,371][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:23:35,693][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:23:36,015][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:23:36,338][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:23:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:23:36,983][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:23:37,305][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:23:37,627][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:23:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:23:38,273][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:23:38,595][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:23:38,917][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:23:39,239][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:23:39,561][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:23:40,184][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:23:40,507][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:23:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:23:41,152][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:23:41,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:23:41,796][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:23:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:23:42,442][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:23:42,764][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:23:43,089][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:23:43,411][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:23:43,734][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:23:44,056][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:23:44,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:23:45,468][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:23:45,470][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:23:45,472][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:23:46,085][__main__][INFO] - Iteration 100 took 27s (11.72% Gen, 86.04% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 47m 48s. Estimated total time: 7h 36m 29s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 14s. [2026-03-25 16:23:46,087][__main__][INFO] - Starting iteration 100. [2026-03-25 16:23:46,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 1 and human policies 1. [2026-03-25 16:23:46,091][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:23:49,308][__main__][INFO] - Number of regex retries in iteration 100: 0 [2026-03-25 16:23:49,309][__main__][INFO] - agents played in iteration 100 are Alice, Bob [2026-03-25 16:23:49,894][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:23:50,563][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:23:50,854][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:23:51,177][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:23:51,498][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:23:51,820][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:23:52,141][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:23:52,463][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:23:52,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:23:53,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:23:53,429][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:23:53,751][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:23:54,072][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:23:54,393][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:23:54,714][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:23:55,035][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:23:55,357][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:23:55,679][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:23:56,000][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:23:56,321][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:23:56,643][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:23:56,965][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:23:57,287][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:23:57,609][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:23:57,930][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:23:58,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:23:58,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:23:58,894][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:23:59,216][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:23:59,537][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:23:59,858][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:24:00,179][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:24:00,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:24:00,821][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:24:01,143][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:24:01,464][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:24:01,786][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:24:02,108][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:24:02,429][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:24:02,750][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:24:03,072][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:24:03,394][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:24:03,715][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:24:04,036][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:24:04,358][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:24:04,679][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:24:05,000][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:24:05,322][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:24:05,644][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:24:05,964][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:24:06,286][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:24:06,608][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:24:06,929][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:24:07,549][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:24:07,872][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:24:08,193][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:24:08,514][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:24:08,836][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:24:09,157][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:24:09,478][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:24:09,799][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:24:10,121][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:24:10,443][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:24:10,765][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:24:11,086][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:24:11,409][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:24:12,071][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:24:12,813][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:24:12,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:24:12,817][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:24:14,132][__main__][INFO] - Iteration 101 took 28s (11.48% Gen, 83.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 58m 13s. Estimated total time: 7h 47m 23s. Time estimates for 10 more iterations: 4m 40s, 100 more iterations: 46m 44s, 500 more iterations: 3h 53m 41s. [2026-03-25 16:24:14,135][__main__][INFO] - Starting iteration 101. [2026-03-25 16:24:14,139][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:24:14,139][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:24:17,354][__main__][INFO] - Number of regex retries in iteration 101: 0 [2026-03-25 16:24:17,354][__main__][INFO] - agents played in iteration 101 are Alice, Bob [2026-03-25 16:24:17,948][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:24:18,615][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:24:18,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:24:19,228][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:24:19,550][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:24:19,871][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:24:20,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:24:20,514][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:24:20,836][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:24:21,157][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:24:21,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:24:21,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:24:22,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:24:22,441][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:24:22,762][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:24:23,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:24:23,404][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:24:23,725][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:24:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:24:24,368][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:24:24,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:24:25,012][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:24:25,334][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:24:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:24:25,977][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:24:26,297][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:24:26,620][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:24:26,941][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:24:27,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:24:27,584][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:24:27,906][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:24:28,228][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:24:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:24:28,872][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:24:29,193][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:24:29,515][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:24:29,836][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:24:30,157][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:24:30,479][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:24:30,802][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:24:31,123][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:24:31,444][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:24:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:24:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:24:32,409][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:24:32,730][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:24:33,052][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:24:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:24:33,695][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:24:34,017][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:24:34,338][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:24:34,659][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:24:34,979][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:24:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:24:35,924][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:24:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:24:36,566][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:24:36,888][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:24:37,210][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:24:37,532][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:24:37,853][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:24:38,175][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:24:38,496][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:24:38,818][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:24:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:24:39,462][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:24:40,123][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:24:40,857][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:24:40,859][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:24:40,860][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:24:41,512][__main__][INFO] - Iteration 102 took 27s (11.75% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 46m 37s. Estimated total time: 7h 36m 14s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 7s. [2026-03-25 16:24:41,514][__main__][INFO] - Starting iteration 102. [2026-03-25 16:24:41,518][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:24:41,518][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:24:44,722][__main__][INFO] - Number of regex retries in iteration 102: 0 [2026-03-25 16:24:44,723][__main__][INFO] - agents played in iteration 102 are Alice, Bob [2026-03-25 16:24:45,320][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:24:45,984][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:24:46,276][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:24:46,599][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:24:46,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:24:47,241][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:24:47,563][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:24:47,883][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:24:48,205][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:24:48,527][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:24:48,849][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:24:49,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:24:49,492][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:24:49,814][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:24:50,137][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:24:50,459][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:24:50,780][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:24:51,102][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:24:51,424][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:24:51,746][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:24:52,068][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:24:52,391][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:24:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:24:53,035][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:24:53,357][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:24:53,679][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:24:54,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:24:54,323][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:24:54,645][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:24:54,967][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:24:55,289][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:24:55,611][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:24:55,934][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:24:56,256][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:24:56,578][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:24:56,899][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:24:57,221][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:24:57,543][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:24:57,866][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:24:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:24:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:24:58,833][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:24:59,156][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:24:59,477][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:24:59,799][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:25:00,122][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:25:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:25:00,767][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:25:01,089][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:25:01,412][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:25:01,733][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:25:02,055][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:25:02,377][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:25:03,010][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:25:03,333][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:25:03,654][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:25:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:25:04,297][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:25:04,619][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:25:04,941][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:25:05,264][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:25:05,586][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:25:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:25:06,229][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:25:06,550][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:25:06,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:25:07,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:25:08,271][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:25:08,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:25:08,275][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:25:08,916][__main__][INFO] - Iteration 103 took 27s (11.70% Gen, 85.96% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 46m 34s. Estimated total time: 7h 36m 39s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 19s. [2026-03-25 16:25:08,918][__main__][INFO] - Starting iteration 103. [2026-03-25 16:25:08,921][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:25:08,922][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:25:12,134][__main__][INFO] - Number of regex retries in iteration 103: 0 [2026-03-25 16:25:12,135][__main__][INFO] - agents played in iteration 103 are Alice, Bob [2026-03-25 16:25:12,720][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:25:13,386][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:25:13,677][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:25:13,998][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:25:14,320][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:25:14,641][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:25:14,962][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:25:15,283][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:25:15,603][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:25:15,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:25:16,248][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:25:16,571][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:25:16,894][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:25:17,215][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:25:17,539][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:25:17,860][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:25:18,181][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:25:18,503][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:25:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:25:19,147][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:25:19,468][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:25:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:25:20,111][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:25:20,432][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:25:20,753][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:25:21,074][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:25:21,395][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:25:21,716][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:25:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:25:22,361][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:25:22,682][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:25:23,004][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:25:23,324][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:25:23,645][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:25:23,967][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:25:24,288][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:25:24,610][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:25:24,931][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:25:25,252][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:25:25,573][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:25:25,895][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:25:26,217][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:25:26,541][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:25:26,865][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:25:27,188][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:25:27,511][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:25:27,834][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:25:28,157][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:25:28,479][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:25:28,801][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:25:29,123][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:25:29,443][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:25:29,764][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:25:30,385][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:25:30,708][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:25:31,030][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:25:31,352][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:25:31,673][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:25:31,994][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:25:32,317][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:25:32,639][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:25:32,961][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:25:33,282][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:25:33,604][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:25:33,925][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:25:34,248][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:25:34,913][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:25:35,670][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:25:35,672][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:25:35,674][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:25:36,326][__main__][INFO] - Iteration 104 took 27s (11.72% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 46m 13s. Estimated total time: 7h 36m 45s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 22s. [2026-03-25 16:25:36,328][__main__][INFO] - Starting iteration 104. [2026-03-25 16:25:36,331][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:25:36,332][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:25:39,590][__main__][INFO] - Number of regex retries in iteration 104: 0 [2026-03-25 16:25:39,591][__main__][INFO] - agents played in iteration 104 are Alice, Bob [2026-03-25 16:25:40,239][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:25:40,909][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:25:41,202][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:25:41,524][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:25:41,847][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:25:42,170][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:25:42,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:25:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:25:43,135][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:25:43,456][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:25:43,776][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:25:44,097][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:25:44,418][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:25:44,740][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:25:45,060][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:25:45,381][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:25:45,703][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:25:46,024][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:25:46,346][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:25:46,667][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:25:46,989][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:25:47,311][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:25:47,633][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:25:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:25:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:25:48,598][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:25:48,919][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:25:49,240][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:25:49,562][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:25:49,884][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:25:50,205][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:25:50,526][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:25:50,848][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:25:51,170][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:25:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:25:51,814][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:25:52,135][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:25:52,456][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:25:52,778][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:25:53,100][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:25:53,421][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:25:53,742][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:25:54,063][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:25:54,384][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:25:54,706][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:25:55,027][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:25:55,348][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:25:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:25:55,992][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:25:56,313][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:25:56,635][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:25:56,957][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:25:57,278][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:25:57,899][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:25:58,221][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:25:58,542][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:25:58,863][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:25:59,184][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:25:59,505][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:25:59,827][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:26:00,149][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:26:00,471][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:26:00,792][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:26:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:26:01,436][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:26:01,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:26:02,421][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:26:03,178][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:26:03,180][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:26:03,181][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:26:03,828][__main__][INFO] - Iteration 105 took 27s (11.85% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 47m 18s. Estimated total time: 7h 38m 17s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 49s, 500 more iterations: 3h 49m 8s. [2026-03-25 16:26:03,830][__main__][INFO] - Starting iteration 105. [2026-03-25 16:26:03,833][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:26:03,834][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:26:07,036][__main__][INFO] - Number of regex retries in iteration 105: 0 [2026-03-25 16:26:07,037][__main__][INFO] - agents played in iteration 105 are Alice, Bob [2026-03-25 16:26:07,616][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:26:08,287][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:26:08,579][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:26:08,901][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:26:09,224][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:26:09,546][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:26:09,868][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:26:10,190][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:26:10,513][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:26:10,835][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:26:11,156][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:26:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:26:11,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:26:12,122][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:26:12,445][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:26:12,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:26:13,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:26:13,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:26:13,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:26:14,056][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:26:14,379][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:26:14,701][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:26:15,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:26:15,344][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:26:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:26:15,988][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:26:16,310][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:26:16,632][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:26:16,954][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:26:17,278][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:26:17,601][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:26:17,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:26:18,244][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:26:18,566][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:26:18,889][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:26:19,211][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:26:19,533][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:26:19,855][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:26:20,177][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:26:20,500][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:26:20,822][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:26:21,145][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:26:21,466][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:26:21,789][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:26:22,111][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:26:22,435][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:26:22,759][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:26:23,081][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:26:23,403][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:26:23,724][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:26:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:26:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:26:24,692][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:26:25,314][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:26:25,636][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:26:25,958][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:26:26,280][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:26:26,602][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:26:26,923][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:26:27,245][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:26:27,566][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:26:27,888][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:26:28,210][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:26:28,531][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:26:28,852][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:26:29,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:26:29,838][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:26:30,586][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:26:30,588][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:26:30,590][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:26:31,237][__main__][INFO] - Iteration 106 took 27s (11.69% Gen, 85.95% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 45m 18s. Estimated total time: 7h 36m 44s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 22s. [2026-03-25 16:26:31,239][__main__][INFO] - Starting iteration 106. [2026-03-25 16:26:31,242][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:26:31,243][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:26:34,481][__main__][INFO] - Number of regex retries in iteration 106: 0 [2026-03-25 16:26:34,482][__main__][INFO] - agents played in iteration 106 are Alice, Bob [2026-03-25 16:26:35,079][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:26:35,751][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:26:36,042][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:26:36,363][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:26:36,685][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:26:37,006][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:26:37,328][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:26:37,650][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:26:37,971][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:26:38,292][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:26:38,613][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:26:38,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:26:39,256][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:26:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:26:39,898][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:26:40,220][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:26:40,542][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:26:40,863][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:26:41,183][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:26:41,505][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:26:41,827][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:26:42,149][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:26:42,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:26:42,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:26:43,114][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:26:43,435][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:26:43,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:26:44,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:26:44,401][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:26:44,723][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:26:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:26:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:26:45,688][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:26:46,011][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:26:46,332][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:26:46,655][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:26:46,976][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:26:47,297][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:26:47,618][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:26:47,940][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:26:48,262][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:26:48,585][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:26:48,907][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:26:49,229][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:26:49,550][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:26:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:26:50,191][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:26:50,512][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:26:50,833][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:26:51,154][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:26:51,475][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:26:51,797][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:26:52,119][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:26:52,740][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:26:53,061][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:26:53,383][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:26:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:26:54,027][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:26:54,349][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:26:54,671][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:26:54,993][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:26:55,314][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:26:55,635][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:26:55,958][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:26:56,280][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:26:56,602][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:26:57,270][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:26:58,018][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:26:58,021][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:26:58,022][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:26:58,671][__main__][INFO] - Iteration 107 took 27s (11.81% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 45m 15s. Estimated total time: 7h 37m 9s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 34s. [2026-03-25 16:26:58,673][__main__][INFO] - Starting iteration 107. [2026-03-25 16:26:58,676][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:26:58,677][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:27:01,898][__main__][INFO] - Number of regex retries in iteration 107: 0 [2026-03-25 16:27:01,899][__main__][INFO] - agents played in iteration 107 are Alice, Bob [2026-03-25 16:27:02,487][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:27:03,160][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:27:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:27:03,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:27:04,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:27:04,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:27:04,738][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:27:05,060][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:27:05,382][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:27:05,703][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:27:06,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:27:06,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:27:06,669][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:27:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:27:07,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:27:07,634][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:27:07,955][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:27:08,277][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:27:08,598][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:27:08,919][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:27:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:27:09,562][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:27:09,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:27:10,203][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:27:10,524][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:27:10,845][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:27:11,166][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:27:11,487][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:27:11,809][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:27:12,131][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:27:12,452][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:27:12,774][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:27:13,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:27:13,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:27:13,739][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:27:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:27:14,382][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:27:14,703][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:27:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:27:15,348][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:27:15,670][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:27:15,992][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:27:16,313][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:27:16,635][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:27:16,956][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:27:17,278][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:27:17,600][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:27:17,921][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:27:18,243][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:27:18,565][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:27:18,886][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:27:19,207][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:27:19,528][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:27:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:27:20,473][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:27:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:27:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:27:21,436][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:27:21,757][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:27:22,078][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:27:22,400][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:27:22,721][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:27:23,043][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:27:23,364][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:27:23,686][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:27:24,008][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:27:24,673][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:27:25,422][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:27:25,424][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:27:25,426][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:27:26,074][__main__][INFO] - Iteration 108 took 27s (11.76% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 44m 16s. Estimated total time: 7h 36m 38s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 19s. [2026-03-25 16:27:26,076][__main__][INFO] - Starting iteration 108. [2026-03-25 16:27:26,080][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:27:26,080][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:27:29,300][__main__][INFO] - Number of regex retries in iteration 108: 0 [2026-03-25 16:27:29,301][__main__][INFO] - agents played in iteration 108 are Alice, Bob [2026-03-25 16:27:29,901][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:27:30,573][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:27:30,865][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:27:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:27:31,511][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:27:31,834][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:27:32,156][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:27:32,479][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:27:32,801][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:27:33,123][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:27:33,445][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:27:33,767][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:27:34,090][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:27:34,412][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:27:34,734][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:27:35,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:27:35,380][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:27:35,702][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:27:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:27:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:27:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:27:36,990][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:27:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:27:37,634][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:27:37,956][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:27:38,278][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:27:38,600][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:27:38,923][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:27:39,245][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:27:39,566][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:27:39,888][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:27:40,211][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:27:40,533][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:27:40,855][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:27:41,178][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:27:41,500][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:27:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:27:42,146][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:27:42,469][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:27:42,792][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:27:43,114][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:27:43,436][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:27:43,758][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:27:44,080][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:27:44,403][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:27:44,724][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:27:45,046][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:27:45,368][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:27:45,688][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:27:46,010][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:27:46,331][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:27:46,653][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:27:46,974][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:27:47,596][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:27:47,917][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:27:48,239][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:27:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:27:48,883][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:27:49,204][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:27:49,525][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:27:49,845][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:27:50,167][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:27:50,489][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:27:50,811][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:27:51,132][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:27:51,454][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:27:52,119][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:27:52,876][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:27:52,878][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:27:52,879][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:27:53,528][__main__][INFO] - Iteration 109 took 27s (11.73% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 44m 40s. Estimated total time: 7h 37m 29s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 44s. [2026-03-25 16:27:53,530][__main__][INFO] - Starting iteration 109. [2026-03-25 16:27:53,534][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:27:53,535][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:27:56,762][__main__][INFO] - Number of regex retries in iteration 109: 0 [2026-03-25 16:27:56,763][__main__][INFO] - agents played in iteration 109 are Alice, Bob [2026-03-25 16:27:57,359][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:27:58,031][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:27:58,322][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:27:58,644][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:27:58,965][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:27:59,287][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:27:59,609][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:27:59,931][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:28:00,253][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:28:00,574][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:28:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:28:01,217][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:28:01,538][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:28:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:28:02,181][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:28:02,502][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:28:02,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:28:03,146][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:28:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:28:03,790][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:28:04,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:28:04,434][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:28:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:28:05,076][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:28:05,398][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:28:05,720][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:28:06,041][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:28:06,362][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:28:06,684][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:28:07,005][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:28:07,327][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:28:07,649][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:28:07,970][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:28:08,292][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:28:08,614][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:28:08,935][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:28:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:28:09,578][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:28:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:28:10,220][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:28:10,541][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:28:10,863][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:28:11,185][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:28:11,506][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:28:11,828][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:28:12,149][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:28:12,470][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:28:12,792][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:28:13,114][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:28:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:28:13,756][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:28:14,078][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:28:14,398][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:28:15,022][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:28:15,343][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:28:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:28:15,986][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:28:16,307][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:28:16,629][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:28:16,951][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:28:17,272][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:28:17,594][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:28:17,916][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:28:18,237][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:28:18,559][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:28:18,881][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:28:19,546][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:28:20,289][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:28:20,291][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:28:20,293][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:28:20,941][__main__][INFO] - Iteration 110 took 27s (11.78% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 43m 31s. Estimated total time: 7h 36m 47s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 23s. [2026-03-25 16:28:20,943][__main__][INFO] - Starting iteration 110. [2026-03-25 16:28:20,946][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:28:20,946][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:28:24,157][__main__][INFO] - Number of regex retries in iteration 110: 0 [2026-03-25 16:28:24,158][__main__][INFO] - agents played in iteration 110 are Alice, Bob [2026-03-25 16:28:24,747][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:28:25,419][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:28:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:28:26,032][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:28:26,354][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:28:26,676][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:28:26,997][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:28:27,318][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:28:27,639][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:28:27,962][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:28:28,284][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:28:28,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:28:28,927][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:28:29,248][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:28:29,570][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:28:29,891][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:28:30,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:28:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:28:30,856][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:28:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:28:31,500][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:28:31,822][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:28:32,143][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:28:32,465][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:28:32,786][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:28:33,109][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:28:33,431][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:28:33,752][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:28:34,074][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:28:34,396][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:28:34,718][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:28:35,039][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:28:35,360][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:28:35,682][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:28:36,004][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:28:36,327][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:28:36,649][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:28:36,970][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:28:37,292][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:28:37,613][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:28:37,934][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:28:38,258][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:28:38,579][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:28:38,901][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:28:39,223][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:28:39,544][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:28:39,865][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:28:40,187][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:28:40,509][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:28:40,831][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:28:41,153][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:28:41,475][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:28:41,796][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:28:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:28:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:28:43,062][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:28:43,383][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:28:43,704][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:28:44,026][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:28:44,348][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:28:44,669][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:28:44,991][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:28:45,313][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:28:45,634][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:28:45,956][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:28:46,279][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:28:46,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:28:47,701][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:28:47,703][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:28:47,705][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:28:48,354][__main__][INFO] - Iteration 111 took 27s (11.72% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 43m 5s. Estimated total time: 7h 36m 49s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 24s. [2026-03-25 16:28:48,356][__main__][INFO] - Starting iteration 111. [2026-03-25 16:28:48,359][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:28:48,360][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:28:51,615][__main__][INFO] - Number of regex retries in iteration 111: 0 [2026-03-25 16:28:51,616][__main__][INFO] - agents played in iteration 111 are Alice, Bob [2026-03-25 16:28:52,214][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:28:52,884][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:28:53,177][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:28:53,500][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:28:53,821][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:28:54,143][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:28:54,466][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:28:54,787][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:28:55,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:28:55,432][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:28:55,755][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:28:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:28:56,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:28:56,722][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:28:57,044][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:28:57,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:28:57,690][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:28:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:28:58,336][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:28:58,658][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:28:58,981][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:28:59,304][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:28:59,626][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:28:59,949][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:29:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:29:00,595][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:29:00,918][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:29:01,241][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:29:01,563][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:29:01,886][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:29:02,207][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:29:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:29:02,850][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:29:03,173][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:29:03,495][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:29:03,818][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:29:04,141][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:29:04,464][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:29:04,785][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:29:05,106][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:29:05,429][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:29:05,751][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:29:06,072][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:29:06,394][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:29:06,716][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:29:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:29:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:29:07,681][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:29:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:29:08,323][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:29:08,644][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:29:08,966][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:29:09,287][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:29:09,905][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:29:10,226][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:29:10,547][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:29:10,869][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:29:11,191][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:29:11,512][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:29:11,834][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:29:12,155][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:29:12,476][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:29:12,798][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:29:13,119][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:29:13,440][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:29:13,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:29:14,421][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:29:15,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:29:15,170][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:29:15,172][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:29:15,813][__main__][INFO] - Iteration 112 took 27s (11.86% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 43m 23s. Estimated total time: 7h 37m 34s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 45s, 500 more iterations: 3h 48m 47s. [2026-03-25 16:29:15,815][__main__][INFO] - Starting iteration 112. [2026-03-25 16:29:15,818][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:29:15,818][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:29:19,035][__main__][INFO] - Number of regex retries in iteration 112: 0 [2026-03-25 16:29:19,035][__main__][INFO] - agents played in iteration 112 are Alice, Bob [2026-03-25 16:29:19,618][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:29:20,281][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:29:20,572][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:29:20,895][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:29:21,217][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:29:21,537][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:29:21,859][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:29:22,182][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:29:22,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:29:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:29:23,147][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:29:23,469][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:29:23,790][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:29:24,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:29:24,433][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:29:24,754][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:29:25,075][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:29:25,396][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:29:25,718][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:29:26,040][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:29:26,361][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:29:26,683][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:29:27,004][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:29:27,325][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:29:27,647][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:29:27,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:29:28,292][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:29:28,613][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:29:28,935][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:29:29,256][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:29:29,578][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:29:29,901][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:29:30,223][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:29:30,545][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:29:30,866][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:29:31,188][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:29:31,510][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:29:31,831][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:29:32,153][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:29:32,474][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:29:32,796][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:29:33,118][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:29:33,439][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:29:33,761][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:29:34,083][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:29:34,405][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:29:34,726][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:29:35,049][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:29:35,371][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:29:35,692][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:29:36,014][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:29:36,336][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:29:36,658][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:29:37,276][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:29:37,598][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:29:37,920][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:29:38,242][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:29:38,563][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:29:38,885][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:29:39,207][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:29:39,529][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:29:39,851][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:29:40,173][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:29:40,495][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:29:40,817][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:29:41,138][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:29:41,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:29:42,552][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:29:42,554][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:29:42,556][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:29:43,193][__main__][INFO] - Iteration 113 took 27s (11.75% Gen, 85.92% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 41m 37s. Estimated total time: 7h 36m 16s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 8s. [2026-03-25 16:29:43,195][__main__][INFO] - Starting iteration 113. [2026-03-25 16:29:43,199][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:29:43,199][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:29:46,420][__main__][INFO] - Number of regex retries in iteration 113: 0 [2026-03-25 16:29:46,420][__main__][INFO] - agents played in iteration 113 are Alice, Bob [2026-03-25 16:29:47,023][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:29:47,686][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:29:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:29:48,300][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:29:48,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:29:48,942][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:29:49,262][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:29:49,583][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:29:49,904][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:29:50,225][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:29:50,546][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:29:50,867][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:29:51,189][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:29:51,510][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:29:51,832][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:29:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:29:52,476][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:29:52,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:29:53,118][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:29:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:29:53,759][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:29:54,081][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:29:54,402][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:29:54,723][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:29:55,045][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:29:55,367][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:29:55,689][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:29:56,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:29:56,332][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:29:56,653][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:29:56,974][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:29:57,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:29:57,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:29:57,940][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:29:58,261][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:29:58,583][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:29:58,904][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:29:59,226][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:29:59,548][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:29:59,870][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:30:00,192][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:30:00,514][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:30:00,836][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:30:01,156][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:30:01,477][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:30:01,799][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:30:02,122][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:30:02,444][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:30:02,765][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:30:03,086][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:30:03,407][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:30:03,729][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:30:04,050][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:30:04,668][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:30:04,991][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:30:05,311][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:30:05,634][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:30:05,955][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:30:06,276][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:30:06,596][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:30:06,918][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:30:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:30:07,562][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:30:07,884][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:30:08,206][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:30:08,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:30:09,185][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:30:09,941][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:30:09,944][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:30:09,945][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:30:10,586][__main__][INFO] - Iteration 114 took 27s (11.76% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 41m 22s. Estimated total time: 7h 36m 28s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 14s. [2026-03-25 16:30:10,588][__main__][INFO] - Starting iteration 114. [2026-03-25 16:30:10,591][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:30:10,592][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:30:13,830][__main__][INFO] - Number of regex retries in iteration 114: 0 [2026-03-25 16:30:13,831][__main__][INFO] - agents played in iteration 114 are Alice, Bob [2026-03-25 16:30:14,430][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:30:15,100][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:30:15,392][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:30:15,714][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:30:16,037][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:30:16,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:30:16,681][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:30:17,004][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:30:17,326][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:30:17,648][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:30:17,972][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:30:18,296][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:30:18,619][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:30:18,943][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:30:19,265][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:30:19,587][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:30:19,909][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:30:20,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:30:20,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:30:20,878][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:30:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:30:21,525][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:30:21,847][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:30:22,169][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:30:22,492][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:30:22,814][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:30:23,137][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:30:23,460][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:30:23,783][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:30:24,105][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:30:24,427][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:30:24,747][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:30:25,068][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:30:25,390][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:30:25,712][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:30:26,033][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:30:26,354][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:30:26,676][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:30:26,996][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:30:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:30:27,639][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:30:27,960][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:30:28,281][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:30:28,603][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:30:28,924][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:30:29,246][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:30:29,567][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:30:29,889][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:30:30,211][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:30:30,532][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:30:30,852][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:30:31,174][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:30:31,495][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:30:32,113][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:30:32,435][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:30:32,757][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:30:33,079][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:30:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:30:33,722][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:30:34,044][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:30:34,365][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:30:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:30:35,007][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:30:35,330][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:30:35,650][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:30:35,972][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:30:36,629][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:30:37,374][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:30:37,376][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:30:37,378][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:30:38,018][__main__][INFO] - Iteration 115 took 27s (11.81% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 41m 34s. Estimated total time: 7h 37m 8s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 34s. [2026-03-25 16:30:38,021][__main__][INFO] - Starting iteration 115. [2026-03-25 16:30:38,024][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:30:38,025][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:30:41,249][__main__][INFO] - Number of regex retries in iteration 115: 0 [2026-03-25 16:30:41,249][__main__][INFO] - agents played in iteration 115 are Alice, Bob [2026-03-25 16:30:41,832][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:30:42,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:30:42,787][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:30:43,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:30:43,431][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:30:43,752][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:30:44,073][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:30:44,394][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:30:44,715][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:30:45,037][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:30:45,358][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:30:45,679][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:30:46,000][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:30:46,322][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:30:46,644][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:30:46,966][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:30:47,288][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:30:47,609][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:30:47,930][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:30:48,252][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:30:48,573][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:30:48,895][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:30:49,217][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:30:49,538][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:30:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:30:50,180][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:30:50,502][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:30:50,823][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:30:51,144][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:30:51,466][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:30:51,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:30:52,110][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:30:52,431][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:30:52,752][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:30:53,074][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:30:53,395][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:30:53,716][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:30:54,037][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:30:54,359][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:30:54,680][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:30:55,001][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:30:55,322][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:30:55,643][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:30:55,965][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:30:56,287][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:30:56,608][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:30:56,930][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:30:57,251][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:30:57,573][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:30:57,894][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:30:58,216][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:30:58,538][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:30:58,860][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:30:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:30:59,803][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:31:00,125][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:31:00,447][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:31:00,769][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:31:01,091][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:31:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:31:01,734][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:31:02,055][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:31:02,377][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:31:02,698][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:31:03,019][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:31:03,339][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:31:04,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:31:04,750][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:31:04,752][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:31:04,754][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:31:05,402][__main__][INFO] - Iteration 116 took 27s (11.78% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 40m 18s. Estimated total time: 7h 36m 19s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 9s. [2026-03-25 16:31:05,404][__main__][INFO] - Starting iteration 116. [2026-03-25 16:31:05,408][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:31:05,409][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:31:08,691][__main__][INFO] - Number of regex retries in iteration 116: 0 [2026-03-25 16:31:08,692][__main__][INFO] - agents played in iteration 116 are Alice, Bob [2026-03-25 16:31:09,348][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:31:10,019][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:31:10,311][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:31:10,634][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:31:10,955][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:31:11,277][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:31:11,598][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:31:11,919][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:31:12,241][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:31:12,562][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:31:12,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:31:13,205][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:31:13,526][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:31:13,847][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:31:14,170][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:31:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:31:14,815][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:31:15,136][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:31:15,458][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:31:15,778][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:31:16,100][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:31:16,423][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:31:16,745][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:31:17,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:31:17,388][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:31:17,710][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:31:18,031][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:31:18,353][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:31:18,674][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:31:18,997][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:31:19,318][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:31:19,640][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:31:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:31:20,283][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:31:20,604][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:31:20,926][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:31:21,247][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:31:21,569][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:31:21,891][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:31:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:31:22,535][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:31:22,856][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:31:23,178][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:31:23,499][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:31:23,821][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:31:24,141][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:31:24,464][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:31:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:31:25,107][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:31:25,429][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:31:25,751][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:31:26,072][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:31:26,394][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:31:27,016][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:31:27,338][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:31:27,660][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:31:27,981][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:31:28,302][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:31:28,624][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:31:28,947][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:31:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:31:29,591][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:31:29,913][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:31:30,235][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:31:30,557][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:31:30,880][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:31:31,551][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:31:32,300][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:31:32,302][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:31:32,304][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:31:32,951][__main__][INFO] - Iteration 117 took 27s (11.92% Gen, 85.72% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 42m 36s. Estimated total time: 7h 39m 4s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 54s, 500 more iterations: 3h 49m 32s. [2026-03-25 16:31:32,954][__main__][INFO] - Starting iteration 117. [2026-03-25 16:31:32,957][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:31:32,957][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:31:36,217][__main__][INFO] - Number of regex retries in iteration 117: 0 [2026-03-25 16:31:36,218][__main__][INFO] - agents played in iteration 117 are Alice, Bob [2026-03-25 16:31:36,835][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:31:37,513][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:31:37,805][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:31:38,128][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:31:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:31:38,771][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:31:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:31:39,415][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:31:39,738][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:31:40,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:31:40,382][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:31:40,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:31:41,028][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:31:41,351][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:31:41,673][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:31:41,995][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:31:42,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:31:42,640][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:31:42,963][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:31:43,286][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:31:43,609][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:31:43,931][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:31:44,253][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:31:44,574][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:31:44,895][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:31:45,216][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:31:45,538][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:31:45,859][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:31:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:31:46,503][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:31:46,824][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:31:47,146][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:31:47,467][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:31:47,789][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:31:48,110][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:31:48,431][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:31:48,752][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:31:49,074][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:31:49,396][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:31:49,718][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:31:50,040][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:31:50,362][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:31:50,684][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:31:51,005][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:31:51,326][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:31:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:31:51,969][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:31:52,291][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:31:52,613][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:31:52,935][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:31:53,257][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:31:53,579][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:31:53,901][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:31:54,523][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:31:54,844][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:31:55,165][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:31:55,487][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:31:55,810][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:31:56,131][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:31:56,454][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:31:56,776][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:31:57,096][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:31:57,418][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:31:57,740][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:31:58,062][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:31:58,383][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:31:59,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:31:59,796][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:31:59,799][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:31:59,800][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:32:00,451][__main__][INFO] - Iteration 118 took 27s (11.86% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 41m 19s. Estimated total time: 7h 38m 15s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 49s, 500 more iterations: 3h 49m 7s. [2026-03-25 16:32:00,453][__main__][INFO] - Starting iteration 118. [2026-03-25 16:32:00,456][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:32:00,457][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:32:03,725][__main__][INFO] - Number of regex retries in iteration 118: 0 [2026-03-25 16:32:03,726][__main__][INFO] - agents played in iteration 118 are Alice, Bob [2026-03-25 16:32:04,309][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:32:04,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:32:05,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:32:05,595][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:32:05,918][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:32:06,239][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:32:06,561][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:32:06,882][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:32:07,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:32:07,524][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:32:07,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:32:08,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:32:08,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:32:08,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:32:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:32:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:32:09,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:32:10,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:32:10,416][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:32:10,737][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:32:11,058][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:32:11,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:32:11,701][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:32:12,023][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:32:12,344][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:32:12,665][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:32:12,987][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:32:13,309][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:32:13,630][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:32:13,952][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:32:14,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:32:14,596][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:32:14,919][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:32:15,241][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:32:15,562][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:32:15,883][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:32:16,204][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:32:16,525][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:32:16,846][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:32:17,166][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:32:17,487][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:32:17,809][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:32:18,130][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:32:18,452][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:32:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:32:19,094][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:32:19,416][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:32:19,738][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:32:20,059][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:32:20,381][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:32:20,703][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:32:21,024][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:32:21,346][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:32:21,969][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:32:22,290][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:32:22,611][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:32:22,932][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:32:23,254][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:32:23,576][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:32:23,897][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:32:24,219][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:32:24,540][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:32:24,861][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:32:25,181][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:32:25,502][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:32:25,824][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:32:26,490][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:32:27,243][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:32:27,245][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:32:27,247][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:32:27,895][__main__][INFO] - Iteration 119 took 27s (11.91% Gen, 85.72% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 39m 56s. Estimated total time: 7h 37m 20s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 40s. [2026-03-25 16:32:27,897][__main__][INFO] - Starting iteration 119. [2026-03-25 16:32:27,900][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:32:27,901][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:32:31,162][__main__][INFO] - Number of regex retries in iteration 119: 0 [2026-03-25 16:32:31,163][__main__][INFO] - agents played in iteration 119 are Alice, Bob [2026-03-25 16:32:31,790][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:32:32,464][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:32:32,756][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:32:33,077][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:32:33,400][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:32:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:32:34,042][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:32:34,364][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:32:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:32:35,007][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:32:35,328][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:32:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:32:35,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:32:36,292][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:32:36,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:32:36,934][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:32:37,255][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:32:37,577][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:32:37,898][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:32:38,220][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:32:38,541][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:32:38,863][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:32:39,184][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:32:39,504][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:32:39,826][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:32:40,147][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:32:40,468][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:32:40,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:32:41,112][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:32:41,434][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:32:41,755][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:32:42,077][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:32:42,398][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:32:42,720][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:32:43,040][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:32:43,361][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:32:43,682][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:32:44,003][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:32:44,324][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:32:44,645][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:32:44,966][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:32:45,287][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:32:45,610][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:32:45,932][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:32:46,253][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:32:46,574][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:32:46,896][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:32:47,217][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:32:47,540][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:32:47,862][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:32:48,183][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:32:48,506][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:32:48,828][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:32:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:32:49,774][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:32:50,095][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:32:50,417][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:32:50,739][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:32:51,061][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:32:51,383][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:32:51,705][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:32:52,027][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:32:52,350][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:32:52,672][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:32:52,994][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:32:53,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:32:53,991][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:32:54,741][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:32:54,743][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:32:54,745][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:32:55,395][__main__][INFO] - Iteration 120 took 27s (11.86% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 40m 24s. Estimated total time: 7h 38m 15s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 49s, 500 more iterations: 3h 49m 7s. [2026-03-25 16:32:55,397][__main__][INFO] - Starting iteration 120. [2026-03-25 16:32:55,400][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:32:55,401][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:32:58,651][__main__][INFO] - Number of regex retries in iteration 120: 0 [2026-03-25 16:32:58,651][__main__][INFO] - agents played in iteration 120 are Alice, Bob [2026-03-25 16:32:59,291][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:32:59,992][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:33:00,284][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:33:00,606][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:33:00,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:33:01,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:33:01,572][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:33:01,894][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:33:02,216][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:33:02,537][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:33:02,859][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:33:03,180][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:33:03,501][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:33:03,823][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:33:04,145][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:33:04,467][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:33:04,789][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:33:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:33:05,432][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:33:05,753][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:33:06,074][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:33:06,395][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:33:06,716][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:33:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:33:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:33:07,680][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:33:08,002][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:33:08,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:33:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:33:08,967][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:33:09,288][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:33:09,609][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:33:09,932][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:33:10,253][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:33:10,575][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:33:10,896][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:33:11,217][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:33:11,539][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:33:11,860][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:33:12,182][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:33:12,503][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:33:12,825][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:33:13,147][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:33:13,468][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:33:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:33:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:33:14,432][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:33:14,754][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:33:15,075][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:33:15,397][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:33:15,717][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:33:16,039][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:33:16,360][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:33:16,981][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:33:17,303][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:33:17,624][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:33:17,945][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:33:18,266][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:33:18,588][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:33:18,909][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:33:19,231][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:33:19,553][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:33:19,874][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:33:20,196][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:33:20,517][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:33:20,838][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:33:21,504][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:33:22,253][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:33:22,256][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:33:22,258][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:33:22,903][__main__][INFO] - Iteration 121 took 27s (11.82% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 40m 5s. Estimated total time: 7h 38m 24s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 50s, 500 more iterations: 3h 49m 12s. [2026-03-25 16:33:22,905][__main__][INFO] - Starting iteration 121. [2026-03-25 16:33:22,909][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:33:22,909][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:33:26,099][__main__][INFO] - Number of regex retries in iteration 121: 0 [2026-03-25 16:33:26,100][__main__][INFO] - agents played in iteration 121 are Alice, Bob [2026-03-25 16:33:26,686][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:33:27,687][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:33:27,978][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:33:28,300][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:33:28,621][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:33:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:33:29,264][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:33:29,586][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:33:29,908][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:33:30,229][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:33:30,550][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:33:30,872][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:33:31,194][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:33:31,515][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:33:31,837][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:33:32,159][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:33:32,480][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:33:32,801][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:33:33,122][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:33:33,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:33:33,763][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:33:34,086][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:33:34,408][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:33:34,728][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:33:35,049][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:33:35,370][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:33:35,692][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:33:36,013][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:33:36,334][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:33:36,657][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:33:36,978][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:33:37,299][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:33:37,622][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:33:37,942][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:33:38,264][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:33:38,586][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:33:38,908][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:33:39,231][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:33:39,553][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:33:39,874][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:33:40,196][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:33:40,518][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:33:40,839][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:33:41,161][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:33:41,483][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:33:41,804][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:33:42,125][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:33:42,448][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:33:42,770][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:33:43,091][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:33:43,413][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:33:43,734][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:33:44,055][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:33:44,676][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:33:44,997][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:33:45,319][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:33:45,640][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:33:45,962][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:33:46,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:33:46,604][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:33:46,925][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:33:47,246][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:33:47,567][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:33:47,888][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:33:48,210][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:33:48,532][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:33:49,197][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:33:49,949][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:33:49,951][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:33:49,953][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:33:50,602][__main__][INFO] - Iteration 122 took 27s (11.52% Gen, 86.13% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 42m 48s. Estimated total time: 7h 41m 34s. Time estimates for 10 more iterations: 4m 36s, 100 more iterations: 46m 9s, 500 more iterations: 3h 50m 47s. [2026-03-25 16:33:50,605][__main__][INFO] - Starting iteration 122. [2026-03-25 16:33:50,608][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:33:50,609][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:33:53,830][__main__][INFO] - Number of regex retries in iteration 122: 0 [2026-03-25 16:33:53,831][__main__][INFO] - agents played in iteration 122 are Alice, Bob [2026-03-25 16:33:54,464][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:33:55,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:33:55,427][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:33:55,749][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:33:56,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:33:56,392][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:33:56,713][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:33:57,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:33:57,357][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:33:57,679][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:33:58,001][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:33:58,322][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:33:58,643][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:33:58,964][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:33:59,284][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:33:59,605][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:33:59,926][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:34:00,247][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:34:00,568][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:34:00,889][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:34:01,211][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:34:01,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:34:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:34:02,176][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:34:02,497][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:34:02,819][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:34:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:34:03,463][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:34:03,785][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:34:04,107][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:34:04,428][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:34:04,749][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:34:05,071][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:34:05,393][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:34:05,715][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:34:06,036][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:34:06,357][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:34:06,679][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:34:07,000][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:34:07,321][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:34:07,642][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:34:07,964][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:34:08,284][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:34:08,606][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:34:08,928][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:34:09,249][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:34:09,570][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:34:09,892][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:34:10,214][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:34:10,535][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:34:10,858][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:34:11,180][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:34:11,502][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:34:12,127][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:34:12,449][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:34:12,771][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:34:13,093][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:34:13,415][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:34:13,736][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:34:14,058][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:34:14,380][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:34:14,703][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:34:15,027][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:34:15,349][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:34:15,671][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:34:15,992][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:34:16,664][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:34:17,415][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:34:17,418][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:34:17,419][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:34:18,066][__main__][INFO] - Iteration 123 took 27s (11.74% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 38m 25s. Estimated total time: 7h 37m 39s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 45s, 500 more iterations: 3h 48m 49s. [2026-03-25 16:34:18,069][__main__][INFO] - Starting iteration 123. [2026-03-25 16:34:18,072][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:34:18,072][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:34:21,326][__main__][INFO] - Number of regex retries in iteration 123: 0 [2026-03-25 16:34:21,327][__main__][INFO] - agents played in iteration 123 are Alice, Bob [2026-03-25 16:34:21,929][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:34:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:34:22,893][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:34:23,216][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:34:23,538][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:34:23,860][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:34:24,181][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:34:24,503][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:34:24,824][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:34:25,145][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:34:25,466][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:34:25,786][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:34:26,109][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:34:26,430][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:34:26,752][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:34:27,074][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:34:27,396][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:34:27,717][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:34:28,039][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:34:28,360][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:34:28,681][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:34:29,003][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:34:29,325][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:34:29,645][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:34:29,966][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:34:30,287][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:34:30,607][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:34:30,929][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:34:31,251][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:34:31,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:34:31,894][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:34:32,216][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:34:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:34:32,859][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:34:33,179][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:34:33,501][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:34:33,822][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:34:34,143][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:34:34,465][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:34:34,787][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:34:35,108][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:34:35,429][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:34:35,749][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:34:36,070][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:34:36,391][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:34:36,713][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:34:37,034][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:34:37,355][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:34:37,677][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:34:37,998][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:34:38,320][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:34:38,640][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:34:38,961][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:34:39,583][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:34:39,904][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:34:40,226][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:34:40,548][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:34:40,869][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:34:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:34:41,511][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:34:41,833][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:34:42,155][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:34:42,478][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:34:42,800][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:34:43,121][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:34:43,442][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:34:44,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:34:44,864][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:34:44,871][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:34:44,873][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:34:45,521][__main__][INFO] - Iteration 124 took 27s (11.86% Gen, 85.78% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 37m 48s. Estimated total time: 7h 37m 30s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 45s, 500 more iterations: 3h 48m 45s. [2026-03-25 16:34:45,523][__main__][INFO] - Starting iteration 124. [2026-03-25 16:34:45,526][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:34:45,527][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:34:48,749][__main__][INFO] - Number of regex retries in iteration 124: 0 [2026-03-25 16:34:48,750][__main__][INFO] - agents played in iteration 124 are Alice, Bob [2026-03-25 16:34:49,346][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:34:50,017][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:34:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:34:50,630][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:34:50,951][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:34:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:34:51,593][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:34:51,914][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:34:52,236][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:34:52,557][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:34:52,878][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:34:53,199][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:34:53,520][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:34:53,841][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:34:54,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:34:54,482][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:34:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:34:55,124][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:34:55,445][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:34:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:34:56,086][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:34:56,408][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:34:56,728][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:34:57,051][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:34:57,372][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:34:57,693][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:34:58,014][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:34:58,336][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:34:58,659][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:34:58,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:34:59,300][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:34:59,621][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:34:59,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:35:00,263][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:35:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:35:00,904][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:35:01,225][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:35:01,547][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:35:01,868][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:35:02,190][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:35:02,512][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:35:02,833][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:35:03,154][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:35:03,476][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:35:03,797][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:35:04,118][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:35:04,439][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:35:04,761][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:35:05,084][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:35:05,405][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:35:05,727][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:35:06,049][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:35:06,371][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:35:06,994][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:35:07,315][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:35:07,636][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:35:07,958][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:35:08,278][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:35:08,599][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:35:08,920][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:35:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:35:09,562][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:35:09,883][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:35:10,204][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:35:10,526][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:35:10,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:35:11,513][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:35:12,260][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:35:12,262][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:35:12,264][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:35:12,882][__main__][INFO] - Iteration 125 took 27s (11.78% Gen, 85.95% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 35m 48s. Estimated total time: 7h 35m 56s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 58s. [2026-03-25 16:35:12,885][__main__][INFO] - Starting iteration 125. [2026-03-25 16:35:12,888][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:35:12,888][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:35:16,140][__main__][INFO] - Number of regex retries in iteration 125: 0 [2026-03-25 16:35:16,141][__main__][INFO] - agents played in iteration 125 are Alice, Bob [2026-03-25 16:35:16,724][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:35:17,395][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:35:17,688][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:35:18,011][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:35:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:35:18,654][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:35:18,976][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:35:19,297][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:35:19,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:35:19,940][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:35:20,262][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:35:20,584][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:35:20,904][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:35:21,225][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:35:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:35:21,867][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:35:22,189][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:35:22,510][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:35:22,831][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:35:23,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:35:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:35:23,796][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:35:24,117][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:35:24,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:35:24,760][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:35:25,081][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:35:25,403][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:35:25,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:35:26,046][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:35:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:35:26,688][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:35:27,010][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:35:27,331][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:35:27,653][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:35:27,975][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:35:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:35:28,619][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:35:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:35:29,264][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:35:29,586][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:35:29,908][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:35:30,230][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:35:30,553][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:35:30,874][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:35:31,196][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:35:31,518][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:35:31,841][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:35:32,163][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:35:32,485][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:35:32,807][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:35:33,129][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:35:33,451][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:35:33,773][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:35:34,400][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:35:34,722][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:35:35,044][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:35:35,366][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:35:35,689][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:35:36,011][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:35:36,334][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:35:36,656][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:35:36,979][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:35:37,301][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:35:37,624][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:35:37,946][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:35:38,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:35:38,942][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:35:39,699][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:35:39,701][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:35:39,703][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:35:40,352][__main__][INFO] - Iteration 126 took 27s (11.84% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 37m 9s. Estimated total time: 7h 37m 45s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 46s, 500 more iterations: 3h 48m 52s. [2026-03-25 16:35:40,354][__main__][INFO] - Starting iteration 126. [2026-03-25 16:35:40,357][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:35:40,358][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:35:43,580][__main__][INFO] - Number of regex retries in iteration 126: 0 [2026-03-25 16:35:43,581][__main__][INFO] - agents played in iteration 126 are Alice, Bob [2026-03-25 16:35:44,139][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:35:44,811][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:35:45,102][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:35:45,424][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:35:45,745][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:35:46,067][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:35:46,389][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:35:46,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:35:47,032][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:35:47,354][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:35:47,675][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:35:47,996][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:35:48,318][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:35:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:35:48,961][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:35:49,283][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:35:49,603][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:35:49,924][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:35:50,246][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:35:50,567][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:35:50,889][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:35:51,211][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:35:51,533][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:35:51,854][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:35:52,176][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:35:52,498][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:35:52,818][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:35:53,140][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:35:53,461][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:35:53,782][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:35:54,103][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:35:54,423][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:35:54,744][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:35:55,065][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:35:55,385][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:35:55,706][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:35:56,027][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:35:56,348][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:35:56,670][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:35:56,991][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:35:57,313][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:35:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:35:57,956][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:35:58,277][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:35:58,598][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:35:58,919][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:35:59,240][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:35:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:35:59,884][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:36:00,205][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:36:00,527][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:36:00,848][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:36:01,170][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:36:01,792][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:36:02,113][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:36:02,434][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:36:02,756][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:36:03,079][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:36:03,400][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:36:03,722][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:36:04,043][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:36:04,364][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:36:04,685][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:36:05,005][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:36:05,326][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:36:05,648][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:36:06,312][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:36:07,070][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:36:07,072][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:36:07,074][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:36:07,724][__main__][INFO] - Iteration 127 took 27s (11.78% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 35m 5s. Estimated total time: 7h 36m 8s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 36s, 500 more iterations: 3h 48m 4s. [2026-03-25 16:36:07,727][__main__][INFO] - Starting iteration 127. [2026-03-25 16:36:07,730][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:36:07,731][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:36:10,927][__main__][INFO] - Number of regex retries in iteration 127: 0 [2026-03-25 16:36:10,928][__main__][INFO] - agents played in iteration 127 are Alice, Bob [2026-03-25 16:36:11,486][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:36:12,157][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:36:12,449][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:36:12,771][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:36:13,093][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:36:13,415][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:36:13,737][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:36:14,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:36:14,379][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:36:14,700][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:36:15,022][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:36:15,343][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:36:15,665][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:36:15,986][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:36:16,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:36:16,630][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:36:16,951][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:36:17,273][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:36:17,593][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:36:17,915][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:36:18,236][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:36:18,557][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:36:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:36:19,199][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:36:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:36:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:36:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:36:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:36:20,805][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:36:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:36:21,448][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:36:21,771][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:36:22,094][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:36:22,415][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:36:22,736][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:36:23,057][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:36:23,377][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:36:23,698][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:36:24,020][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:36:24,342][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:36:24,663][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:36:24,986][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:36:25,307][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:36:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:36:25,951][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:36:26,272][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:36:26,593][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:36:26,914][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:36:27,237][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:36:27,559][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:36:27,880][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:36:28,201][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:36:28,522][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:36:29,144][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:36:29,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:36:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:36:30,111][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:36:30,432][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:36:30,754][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:36:31,075][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:36:31,396][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:36:31,718][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:36:32,040][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:36:32,361][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:36:32,682][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:36:33,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:36:33,662][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:36:34,404][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:36:34,406][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:36:34,408][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:36:35,056][__main__][INFO] - Iteration 128 took 27s (11.70% Gen, 85.92% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 33m 56s. Estimated total time: 7h 35m 26s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 32s, 500 more iterations: 3h 47m 43s. [2026-03-25 16:36:35,058][__main__][INFO] - Starting iteration 128. [2026-03-25 16:36:35,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:36:35,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:36:38,276][__main__][INFO] - Number of regex retries in iteration 128: 0 [2026-03-25 16:36:38,276][__main__][INFO] - agents played in iteration 128 are Alice, Bob [2026-03-25 16:36:38,847][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:36:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:36:39,811][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:36:40,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:36:40,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:36:40,778][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:36:41,100][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:36:41,421][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:36:41,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:36:42,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:36:42,385][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:36:42,706][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:36:43,027][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:36:43,349][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:36:43,672][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:36:43,994][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:36:44,316][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:36:44,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:36:44,960][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:36:45,281][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:36:45,603][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:36:45,924][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:36:46,246][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:36:46,567][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:36:46,890][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:36:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:36:47,534][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:36:47,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:36:48,180][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:36:48,502][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:36:48,824][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:36:49,146][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:36:49,467][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:36:49,790][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:36:50,112][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:36:50,435][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:36:50,757][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:36:51,078][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:36:51,401][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:36:51,723][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:36:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:36:52,368][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:36:52,690][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:36:53,012][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:36:53,334][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:36:53,656][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:36:53,979][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:36:54,301][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:36:54,625][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:36:54,947][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:36:55,270][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:36:55,593][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:36:55,915][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:36:56,541][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:36:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:36:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:36:57,509][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:36:57,833][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:36:58,155][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:36:58,477][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:36:58,798][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:36:59,119][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:36:59,441][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:36:59,762][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:37:00,083][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:37:00,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:37:01,069][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:37:01,819][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:37:01,821][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:37:01,823][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:37:02,470][__main__][INFO] - Iteration 129 took 27s (11.73% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 34m 52s. Estimated total time: 7h 36m 50s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 25s. [2026-03-25 16:37:02,472][__main__][INFO] - Starting iteration 129. [2026-03-25 16:37:02,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:37:02,476][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:37:05,699][__main__][INFO] - Number of regex retries in iteration 129: 0 [2026-03-25 16:37:05,699][__main__][INFO] - agents played in iteration 129 are Alice, Bob [2026-03-25 16:37:06,286][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:37:06,958][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:37:07,250][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:37:07,572][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:37:07,893][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:37:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:37:08,536][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:37:08,859][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:37:09,181][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:37:09,501][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:37:09,822][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:37:10,144][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:37:10,466][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:37:10,787][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:37:11,108][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:37:11,429][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:37:11,750][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:37:12,072][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:37:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:37:12,715][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:37:13,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:37:13,357][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:37:13,679][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:37:14,001][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:37:14,321][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:37:14,643][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:37:14,963][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:37:15,285][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:37:15,606][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:37:15,926][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:37:16,247][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:37:16,567][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:37:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:37:17,211][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:37:17,532][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:37:17,854][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:37:18,177][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:37:18,498][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:37:18,820][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:37:19,142][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:37:19,463][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:37:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:37:20,107][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:37:20,429][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:37:20,750][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:37:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:37:21,394][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:37:21,716][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:37:22,037][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:37:22,359][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:37:22,681][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:37:23,002][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:37:23,323][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:37:23,944][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:37:24,266][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:37:24,588][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:37:24,909][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:37:25,231][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:37:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:37:25,874][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:37:26,196][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:37:26,519][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:37:26,841][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:37:27,163][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:37:27,485][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:37:27,807][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:37:28,472][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:37:29,229][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:37:29,231][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:37:29,233][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:37:29,883][__main__][INFO] - Iteration 130 took 27s (11.76% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 34m 22s. Estimated total time: 7h 36m 48s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 24s. [2026-03-25 16:37:29,885][__main__][INFO] - Starting iteration 130. [2026-03-25 16:37:29,888][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:37:29,889][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:37:33,114][__main__][INFO] - Number of regex retries in iteration 130: 0 [2026-03-25 16:37:33,115][__main__][INFO] - agents played in iteration 130 are Alice, Bob [2026-03-25 16:37:33,676][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:37:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:37:34,652][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:37:34,975][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:37:35,296][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:37:35,617][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:37:35,939][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:37:36,261][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:37:36,582][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:37:36,903][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:37:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:37:37,545][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:37:37,867][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:37:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:37:38,509][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:37:38,831][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:37:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:37:39,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:37:39,794][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:37:40,115][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:37:40,437][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:37:40,758][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:37:41,080][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:37:41,401][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:37:41,722][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:37:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:37:42,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:37:42,688][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:37:43,011][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:37:43,334][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:37:43,656][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:37:43,977][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:37:44,298][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:37:44,620][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:37:44,940][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:37:45,261][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:37:45,582][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:37:45,904][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:37:46,225][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:37:46,547][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:37:46,868][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:37:47,190][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:37:47,512][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:37:47,832][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:37:48,154][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:37:48,475][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:37:48,797][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:37:49,118][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:37:49,440][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:37:49,762][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:37:50,083][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:37:50,404][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:37:50,726][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:37:51,349][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:37:51,670][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:37:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:37:52,314][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:37:52,635][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:37:52,956][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:37:53,279][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:37:53,601][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:37:53,923][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:37:54,244][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:37:54,565][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:37:54,887][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:37:55,208][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:37:55,874][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:37:56,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:37:56,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:37:56,638][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:37:57,292][__main__][INFO] - Iteration 131 took 27s (11.77% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 33m 51s. Estimated total time: 7h 36m 44s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 22s. [2026-03-25 16:37:57,294][__main__][INFO] - Starting iteration 131. [2026-03-25 16:37:57,297][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:37:57,297][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:38:00,506][__main__][INFO] - Number of regex retries in iteration 131: 0 [2026-03-25 16:38:00,507][__main__][INFO] - agents played in iteration 131 are Alice, Bob [2026-03-25 16:38:01,083][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:38:01,754][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:38:02,046][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:38:02,368][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:38:02,690][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:38:03,012][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:38:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:38:03,657][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:38:03,980][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:38:04,302][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:38:04,623][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:38:04,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:38:05,267][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:38:05,589][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:38:05,912][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:38:06,234][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:38:06,557][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:38:06,878][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:38:07,199][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:38:07,521][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:38:07,842][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:38:08,164][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:38:08,486][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:38:08,809][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:38:09,131][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:38:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:38:09,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:38:10,099][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:38:10,421][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:38:10,743][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:38:11,064][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:38:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:38:11,710][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:38:12,033][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:38:12,356][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:38:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:38:13,001][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:38:13,323][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:38:13,645][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:38:13,967][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:38:14,288][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:38:14,611][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:38:14,935][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:38:15,258][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:38:15,580][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:38:15,901][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:38:16,222][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:38:16,544][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:38:16,865][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:38:17,186][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:38:17,507][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:38:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:38:18,149][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:38:18,773][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:38:19,094][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:38:19,415][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:38:19,737][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:38:20,059][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:38:20,382][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:38:20,703][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:38:21,025][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:38:21,345][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:38:21,666][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:38:21,988][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:38:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:38:22,631][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:38:23,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:38:24,051][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:38:24,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:38:24,055][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:38:24,710][__main__][INFO] - Iteration 132 took 27s (11.71% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 33m 33s. Estimated total time: 7h 36m 53s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 26s. [2026-03-25 16:38:24,712][__main__][INFO] - Starting iteration 132. [2026-03-25 16:38:24,715][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:38:24,716][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:38:27,939][__main__][INFO] - Number of regex retries in iteration 132: 0 [2026-03-25 16:38:27,940][__main__][INFO] - agents played in iteration 132 are Alice, Bob [2026-03-25 16:38:28,522][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:38:29,196][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:38:29,488][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:38:29,810][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:38:30,131][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:38:30,454][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:38:30,776][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:38:31,098][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:38:31,418][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:38:31,740][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:38:32,061][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:38:32,382][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:38:32,703][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:38:33,024][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:38:33,346][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:38:33,667][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:38:33,988][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:38:34,309][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:38:34,631][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:38:34,953][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:38:35,275][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:38:35,596][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:38:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:38:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:38:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:38:36,883][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:38:37,204][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:38:37,526][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:38:37,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:38:38,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:38:38,491][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:38:38,813][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:38:39,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:38:39,455][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:38:39,777][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:38:40,098][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:38:40,419][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:38:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:38:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:38:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:38:41,706][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:38:42,027][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:38:42,349][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:38:42,671][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:38:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:38:43,314][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:38:43,636][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:38:43,957][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:38:44,278][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:38:44,599][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:38:44,920][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:38:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:38:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:38:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:38:46,506][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:38:46,827][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:38:47,149][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:38:47,471][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:38:47,792][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:38:48,113][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:38:48,435][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:38:48,756][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:38:49,079][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:38:49,400][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:38:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:38:50,043][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:38:50,709][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:38:51,467][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:38:51,469][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:38:51,471][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:38:52,120][__main__][INFO] - Iteration 133 took 27s (11.77% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 32m 58s. Estimated total time: 7h 36m 46s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 23s. [2026-03-25 16:38:52,123][__main__][INFO] - Starting iteration 133. [2026-03-25 16:38:52,126][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:38:52,126][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:38:55,352][__main__][INFO] - Number of regex retries in iteration 133: 0 [2026-03-25 16:38:55,353][__main__][INFO] - agents played in iteration 133 are Alice, Bob [2026-03-25 16:38:55,942][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:38:56,616][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:38:56,908][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:38:57,230][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:38:57,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:38:57,873][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:38:58,194][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:38:58,516][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:38:58,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:38:59,160][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:38:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:38:59,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:39:00,123][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:39:00,444][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:39:00,765][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:39:01,086][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:39:01,407][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:39:01,727][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:39:02,048][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:39:02,371][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:39:02,692][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:39:03,013][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:39:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:39:03,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:39:03,978][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:39:04,299][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:39:04,622][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:39:04,943][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:39:05,264][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:39:05,587][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:39:05,907][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:39:06,228][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:39:06,551][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:39:06,872][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:39:07,194][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:39:07,516][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:39:07,837][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:39:08,160][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:39:08,482][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:39:08,803][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:39:09,124][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:39:09,445][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:39:09,766][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:39:10,087][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:39:10,410][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:39:10,733][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:39:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:39:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:39:11,697][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:39:12,019][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:39:12,340][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:39:12,661][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:39:12,983][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:39:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:39:13,928][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:39:14,249][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:39:14,571][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:39:14,893][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:39:15,214][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:39:15,536][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:39:15,858][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:39:16,179][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:39:16,500][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:39:16,822][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:39:17,143][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:39:17,463][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:39:18,130][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:39:18,884][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:39:18,886][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:39:18,888][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:39:19,538][__main__][INFO] - Iteration 134 took 27s (11.77% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 32m 38s. Estimated total time: 7h 36m 53s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 26s. [2026-03-25 16:39:19,540][__main__][INFO] - Starting iteration 134. [2026-03-25 16:39:19,543][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:39:19,544][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:39:22,800][__main__][INFO] - Number of regex retries in iteration 134: 0 [2026-03-25 16:39:22,801][__main__][INFO] - agents played in iteration 134 are Alice, Bob [2026-03-25 16:39:23,382][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:39:24,064][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:39:24,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:39:24,679][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:39:25,002][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:39:25,324][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:39:25,646][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:39:25,970][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:39:26,292][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:39:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:39:26,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:39:27,257][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:39:27,579][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:39:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:39:28,223][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:39:28,546][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:39:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:39:29,190][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:39:29,513][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:39:29,835][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:39:30,156][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:39:30,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:39:30,802][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:39:31,124][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:39:31,447][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:39:31,769][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:39:32,093][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:39:32,416][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:39:32,738][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:39:33,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:39:33,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:39:33,705][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:39:34,027][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:39:34,350][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:39:34,671][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:39:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:39:35,314][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:39:35,636][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:39:35,959][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:39:36,280][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:39:36,601][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:39:36,922][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:39:37,244][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:39:37,566][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:39:37,888][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:39:38,209][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:39:38,531][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:39:38,852][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:39:39,173][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:39:39,494][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:39:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:39:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:39:40,458][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:39:41,080][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:39:41,402][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:39:41,723][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:39:42,044][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:39:42,365][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:39:42,686][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:39:43,008][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:39:43,329][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:39:43,651][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:39:43,974][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:39:44,296][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:39:44,617][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:39:44,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:39:45,605][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:39:46,336][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:39:46,338][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:39:46,340][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:39:46,969][__main__][INFO] - Iteration 135 took 27s (11.88% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 32m 25s. Estimated total time: 7h 37m 7s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 33s. [2026-03-25 16:39:46,972][__main__][INFO] - Starting iteration 135. [2026-03-25 16:39:46,975][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:39:46,976][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:39:50,226][__main__][INFO] - Number of regex retries in iteration 135: 0 [2026-03-25 16:39:50,227][__main__][INFO] - agents played in iteration 135 are Alice, Bob [2026-03-25 16:39:50,809][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:39:51,482][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:39:51,775][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:39:52,098][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:39:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:39:52,741][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:39:53,065][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:39:53,386][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:39:53,708][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:39:54,028][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:39:54,350][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:39:54,671][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:39:54,991][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:39:55,312][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:39:55,634][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:39:55,955][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:39:56,276][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:39:56,599][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:39:56,921][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:39:57,243][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:39:57,564][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:39:57,884][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:39:58,208][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:39:58,530][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:39:58,851][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:39:59,173][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:39:59,494][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:39:59,815][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:40:00,136][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:40:00,457][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:40:00,778][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:40:01,100][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:40:01,421][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:40:01,742][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:40:02,064][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:40:02,385][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:40:02,708][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:40:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:40:03,353][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:40:03,675][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:40:03,996][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:40:04,316][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:40:04,637][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:40:04,958][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:40:05,279][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:40:05,600][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:40:05,921][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:40:06,242][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:40:06,564][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:40:06,885][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:40:07,206][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:40:07,527][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:40:07,849][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:40:08,474][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:40:08,796][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:40:09,117][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:40:09,439][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:40:09,760][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:40:10,080][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:40:10,402][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:40:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:40:11,044][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:40:11,365][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:40:11,686][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:40:12,010][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:40:12,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:40:13,001][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:40:13,754][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:40:13,756][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:40:13,758][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:40:14,416][__main__][INFO] - Iteration 136 took 27s (11.85% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 32m 11s. Estimated total time: 7h 37m 21s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 40s. [2026-03-25 16:40:14,418][__main__][INFO] - Starting iteration 136. [2026-03-25 16:40:14,422][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:40:14,422][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:40:17,705][__main__][INFO] - Number of regex retries in iteration 136: 0 [2026-03-25 16:40:17,706][__main__][INFO] - agents played in iteration 136 are Alice, Bob [2026-03-25 16:40:18,306][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:40:18,979][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:40:19,272][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:40:19,593][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:40:19,914][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:40:20,235][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:40:20,559][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:40:20,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:40:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:40:21,524][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:40:21,845][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:40:22,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:40:22,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:40:22,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:40:23,131][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:40:23,453][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:40:23,773][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:40:24,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:40:24,416][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:40:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:40:25,060][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:40:25,382][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:40:25,704][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:40:26,027][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:40:26,348][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:40:26,671][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:40:26,992][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:40:27,314][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:40:27,636][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:40:27,956][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:40:28,278][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:40:28,600][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:40:28,922][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:40:29,243][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:40:29,569][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:40:29,892][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:40:30,213][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:40:30,535][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:40:30,857][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:40:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:40:31,500][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:40:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:40:32,142][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:40:32,464][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:40:32,785][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:40:33,106][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:40:33,427][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:40:33,748][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:40:34,070][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:40:34,394][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:40:34,716][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:40:35,037][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:40:35,358][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:40:35,987][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:40:36,309][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:40:36,632][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:40:36,954][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:40:37,275][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:40:37,597][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:40:37,919][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:40:38,240][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:40:38,562][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:40:38,888][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:40:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:40:39,533][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:40:39,855][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:40:40,523][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:40:41,272][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:40:41,274][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:40:41,276][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:40:41,924][__main__][INFO] - Iteration 137 took 27s (11.94% Gen, 85.70% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 32m 46s. Estimated total time: 7h 38m 23s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 50s, 500 more iterations: 3h 49m 11s. [2026-03-25 16:40:41,927][__main__][INFO] - Starting iteration 137. [2026-03-25 16:40:41,930][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:40:41,930][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:40:45,223][__main__][INFO] - Number of regex retries in iteration 137: 0 [2026-03-25 16:40:45,224][__main__][INFO] - agents played in iteration 137 are Alice, Bob [2026-03-25 16:40:45,807][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:40:46,498][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:40:46,791][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:40:47,114][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:40:47,436][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:40:47,758][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:40:48,081][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:40:48,431][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:40:48,753][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:40:49,076][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:40:49,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:40:49,721][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:40:50,043][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:40:50,367][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:40:50,690][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:40:51,013][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:40:51,336][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:40:51,659][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:40:51,981][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:40:52,303][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:40:52,626][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:40:52,947][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:40:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:40:53,594][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:40:53,915][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:40:54,236][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:40:54,557][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:40:54,878][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:40:55,200][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:40:55,521][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:40:55,843][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:40:56,164][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:40:56,487][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:40:56,807][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:40:57,129][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:40:57,450][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:40:57,771][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:40:58,093][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:40:58,416][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:40:58,740][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:40:59,062][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:40:59,383][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:40:59,705][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:41:00,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:41:00,347][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:41:00,668][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:41:00,989][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:41:01,309][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:41:01,631][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:41:01,953][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:41:02,275][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:41:02,596][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:41:02,918][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:41:03,542][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:41:03,864][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:41:04,186][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:41:04,507][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:41:04,828][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:41:05,149][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:41:05,472][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:41:05,793][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:41:06,115][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:41:06,437][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:41:06,758][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:41:07,080][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:41:07,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:41:08,068][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:41:08,830][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:41:08,833][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:41:08,834][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:41:09,494][__main__][INFO] - Iteration 138 took 27s (11.95% Gen, 85.65% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 33m 20s. Estimated total time: 7h 39m 25s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 56s, 500 more iterations: 3h 49m 42s. [2026-03-25 16:41:09,496][__main__][INFO] - Starting iteration 138. [2026-03-25 16:41:09,500][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:41:09,500][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:41:12,733][__main__][INFO] - Number of regex retries in iteration 138: 0 [2026-03-25 16:41:12,734][__main__][INFO] - agents played in iteration 138 are Alice, Bob [2026-03-25 16:41:13,327][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:41:14,025][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:41:14,317][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:41:14,638][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:41:14,959][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:41:15,280][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:41:15,602][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:41:15,923][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:41:16,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:41:16,568][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:41:16,890][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:41:17,212][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:41:17,534][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:41:17,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:41:18,176][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:41:18,498][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:41:18,819][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:41:19,142][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:41:19,465][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:41:19,786][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:41:20,108][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:41:20,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:41:20,752][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:41:21,075][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:41:21,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:41:21,717][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:41:22,039][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:41:22,360][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:41:22,681][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:41:23,003][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:41:23,325][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:41:23,647][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:41:23,970][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:41:24,294][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:41:24,617][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:41:24,938][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:41:25,261][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:41:25,583][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:41:25,905][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:41:26,226][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:41:26,547][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:41:26,868][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:41:27,190][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:41:27,512][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:41:27,834][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:41:28,156][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:41:28,478][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:41:28,799][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:41:29,120][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:41:29,442][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:41:29,763][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:41:30,084][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:41:30,406][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:41:31,036][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:41:31,358][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:41:31,679][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:41:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:41:32,322][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:41:32,643][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:41:32,965][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:41:33,287][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:41:33,609][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:41:33,931][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:41:34,253][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:41:34,575][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:41:34,897][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:41:35,575][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:41:36,321][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:41:36,323][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:41:36,325][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:41:36,982][__main__][INFO] - Iteration 139 took 27s (11.77% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 31m 30s. Estimated total time: 7h 38m 3s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 48s, 500 more iterations: 3h 49m 1s. [2026-03-25 16:41:36,984][__main__][INFO] - Starting iteration 139. [2026-03-25 16:41:36,987][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:41:36,988][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:41:40,237][__main__][INFO] - Number of regex retries in iteration 139: 0 [2026-03-25 16:41:40,238][__main__][INFO] - agents played in iteration 139 are Alice, Bob [2026-03-25 16:41:40,835][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:41:41,520][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:41:41,812][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:41:42,134][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:41:42,456][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:41:42,777][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:41:43,099][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:41:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:41:43,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:41:44,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:41:44,385][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:41:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:41:45,028][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:41:45,350][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:41:45,671][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:41:45,994][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:41:46,317][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:41:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:41:46,961][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:41:47,283][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:41:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:41:47,927][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:41:48,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:41:48,571][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:41:48,894][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:41:49,216][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:41:49,537][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:41:49,859][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:41:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:41:50,503][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:41:50,825][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:41:51,147][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:41:51,469][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:41:51,791][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:41:52,113][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:41:52,434][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:41:52,756][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:41:53,077][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:41:53,399][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:41:53,721][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:41:54,043][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:41:54,365][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:41:54,688][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:41:55,010][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:41:55,331][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:41:55,653][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:41:55,974][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:41:56,296][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:41:56,618][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:41:56,940][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:41:57,263][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:41:57,585][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:41:57,906][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:41:58,540][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:41:58,862][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:41:59,185][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:41:59,509][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:41:59,831][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:42:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:42:00,476][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:42:00,798][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:42:01,119][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:42:01,442][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:42:01,764][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:42:02,087][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:42:02,410][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:42:03,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:42:03,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:42:03,848][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:42:03,849][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:42:04,507][__main__][INFO] - Iteration 140 took 27s (11.81% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 31m 40s. Estimated total time: 7h 38m 40s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 52s, 500 more iterations: 3h 49m 20s. [2026-03-25 16:42:04,509][__main__][INFO] - Starting iteration 140. [2026-03-25 16:42:04,512][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:42:04,513][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:42:07,809][__main__][INFO] - Number of regex retries in iteration 140: 0 [2026-03-25 16:42:07,810][__main__][INFO] - agents played in iteration 140 are Alice, Bob [2026-03-25 16:42:08,442][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:42:09,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:42:09,426][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:42:09,748][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:42:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:42:10,394][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:42:10,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:42:11,039][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:42:11,360][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:42:11,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:42:12,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:42:12,326][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:42:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:42:12,969][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:42:13,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:42:13,611][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:42:13,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:42:14,255][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:42:14,577][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:42:14,899][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:42:15,220][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:42:15,542][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:42:15,863][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:42:16,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:42:16,508][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:42:16,828][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:42:17,150][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:42:17,472][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:42:17,793][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:42:18,115][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:42:18,436][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:42:18,757][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:42:19,079][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:42:19,401][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:42:19,722][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:42:20,044][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:42:20,365][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:42:20,686][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:42:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:42:21,329][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:42:21,651][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:42:21,972][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:42:22,295][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:42:22,616][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:42:22,937][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:42:23,258][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:42:23,580][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:42:23,902][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:42:24,223][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:42:24,545][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:42:24,866][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:42:25,188][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:42:25,511][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:42:26,140][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:42:26,462][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:42:26,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:42:27,105][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:42:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:42:27,748][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:42:28,070][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:42:28,392][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:42:28,714][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:42:29,035][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:42:29,357][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:42:29,678][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:42:30,000][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:42:30,678][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:42:31,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:42:31,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:42:31,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:42:32,084][__main__][INFO] - Iteration 141 took 27s (11.96% Gen, 85.66% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 32m 5s. Estimated total time: 7h 39m 33s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 57s, 500 more iterations: 3h 49m 46s. [2026-03-25 16:42:32,087][__main__][INFO] - Starting iteration 141. [2026-03-25 16:42:32,090][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:42:32,090][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:42:35,350][__main__][INFO] - Number of regex retries in iteration 141: 0 [2026-03-25 16:42:35,351][__main__][INFO] - agents played in iteration 141 are Alice, Bob [2026-03-25 16:42:35,938][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:42:36,622][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:42:36,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:42:37,235][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:42:37,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:42:37,877][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:42:38,198][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:42:38,519][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:42:38,841][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:42:39,163][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:42:39,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:42:39,806][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:42:40,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:42:40,448][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:42:40,770][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:42:41,093][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:42:41,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:42:41,736][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:42:42,058][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:42:42,380][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:42:42,702][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:42:43,025][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:42:43,346][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:42:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:42:43,990][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:42:44,312][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:42:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:42:44,956][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:42:45,277][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:42:45,598][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:42:45,920][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:42:46,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:42:46,562][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:42:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:42:47,206][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:42:47,528][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:42:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:42:48,173][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:42:48,494][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:42:48,816][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:42:49,138][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:42:49,460][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:42:49,781][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:42:50,102][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:42:50,425][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:42:50,746][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:42:51,068][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:42:51,389][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:42:51,712][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:42:52,033][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:42:52,355][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:42:52,676][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:42:52,997][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:42:53,621][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:42:53,942][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:42:54,263][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:42:54,585][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:42:54,907][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:42:55,229][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:42:55,551][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:42:55,873][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:42:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:42:56,515][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:42:56,836][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:42:57,158][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:42:57,479][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:42:58,146][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:42:58,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:42:58,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:42:58,889][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:42:59,539][__main__][INFO] - Iteration 142 took 27s (11.88% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 29m 35s. Estimated total time: 7h 37m 30s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 45s, 500 more iterations: 3h 48m 45s. [2026-03-25 16:42:59,541][__main__][INFO] - Starting iteration 142. [2026-03-25 16:42:59,545][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:42:59,545][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:43:02,791][__main__][INFO] - Number of regex retries in iteration 142: 0 [2026-03-25 16:43:02,792][__main__][INFO] - agents played in iteration 142 are Alice, Bob [2026-03-25 16:43:03,388][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:43:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:43:04,402][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:43:04,724][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:43:05,045][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:43:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:43:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:43:06,010][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:43:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:43:06,654][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:43:06,976][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:43:07,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:43:07,623][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:43:07,944][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:43:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:43:08,588][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:43:08,910][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:43:09,232][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:43:09,554][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:43:09,876][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:43:10,197][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:43:10,519][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:43:10,842][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:43:11,164][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:43:11,486][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:43:11,808][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:43:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:43:12,450][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:43:12,773][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:43:13,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:43:13,416][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:43:13,738][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:43:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:43:14,382][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:43:14,705][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:43:15,028][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:43:15,349][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:43:15,672][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:43:15,994][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:43:16,316][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:43:16,639][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:43:16,962][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:43:17,283][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:43:17,606][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:43:17,929][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:43:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:43:18,573][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:43:18,895][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:43:19,218][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:43:19,540][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:43:19,861][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:43:20,184][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:43:20,506][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:43:21,145][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:43:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:43:21,789][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:43:22,112][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:43:22,434][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:43:22,757][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:43:23,079][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:43:23,401][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:43:23,724][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:43:24,047][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:43:24,369][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:43:24,692][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:43:25,015][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:43:25,691][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:43:26,433][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:43:26,435][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:43:26,437][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:43:27,102][__main__][INFO] - Iteration 143 took 27s (11.78% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 30m 55s. Estimated total time: 7h 39m 18s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 55s, 500 more iterations: 3h 49m 39s. [2026-03-25 16:43:27,105][__main__][INFO] - Starting iteration 143. [2026-03-25 16:43:27,108][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:43:27,108][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:43:30,436][__main__][INFO] - Number of regex retries in iteration 143: 0 [2026-03-25 16:43:30,437][__main__][INFO] - agents played in iteration 143 are Alice, Bob [2026-03-25 16:43:31,086][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:43:31,760][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:43:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:43:32,376][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:43:32,699][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:43:33,021][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:43:33,343][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:43:33,664][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:43:33,985][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:43:34,308][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:43:34,630][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:43:34,951][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:43:35,273][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:43:35,595][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:43:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:43:36,239][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:43:36,561][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:43:36,884][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:43:37,206][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:43:37,527][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:43:37,849][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:43:38,171][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:43:38,493][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:43:38,814][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:43:39,135][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:43:39,456][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:43:39,777][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:43:40,099][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:43:40,420][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:43:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:43:41,063][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:43:41,385][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:43:41,707][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:43:42,028][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:43:42,350][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:43:42,671][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:43:42,993][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:43:43,314][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:43:43,635][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:43:43,956][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:43:44,277][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:43:44,599][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:43:44,920][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:43:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:43:45,563][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:43:45,885][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:43:46,206][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:43:46,528][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:43:46,850][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:43:47,175][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:43:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:43:47,819][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:43:48,140][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:43:48,789][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:43:49,112][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:43:49,434][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:43:49,756][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:43:50,077][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:43:50,399][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:43:50,721][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:43:51,042][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:43:51,364][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:43:51,686][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:43:52,008][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:43:52,330][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:43:52,652][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:43:53,317][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:43:54,062][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:43:54,064][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:43:54,066][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:43:54,731][__main__][INFO] - Iteration 144 took 27s (12.05% Gen, 85.53% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 31m 34s. Estimated total time: 7h 40m 24s. Time estimates for 10 more iterations: 4m 36s, 100 more iterations: 46m 2s, 500 more iterations: 3h 50m 12s. [2026-03-25 16:43:54,733][__main__][INFO] - Starting iteration 144. [2026-03-25 16:43:54,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:43:54,737][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:43:58,013][__main__][INFO] - Number of regex retries in iteration 144: 0 [2026-03-25 16:43:58,014][__main__][INFO] - agents played in iteration 144 are Alice, Bob [2026-03-25 16:43:58,612][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:43:59,285][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:43:59,577][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:43:59,899][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:44:00,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:44:00,543][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:44:00,864][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:44:01,186][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:44:01,507][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:44:01,828][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:44:02,150][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:44:02,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:44:02,793][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:44:03,115][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:44:03,436][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:44:03,758][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:44:04,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:44:04,401][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:44:04,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:44:05,048][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:44:05,370][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:44:05,693][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:44:06,016][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:44:06,338][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:44:06,660][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:44:06,982][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:44:07,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:44:07,627][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:44:07,948][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:44:08,270][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:44:08,591][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:44:08,913][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:44:09,235][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:44:09,557][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:44:09,879][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:44:10,201][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:44:10,523][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:44:10,845][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:44:11,166][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:44:11,488][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:44:11,811][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:44:12,134][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:44:12,456][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:44:12,779][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:44:13,101][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:44:13,422][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:44:13,743][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:44:14,065][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:44:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:44:14,710][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:44:15,032][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:44:15,356][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:44:15,678][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:44:16,320][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:44:16,641][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:44:16,962][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:44:17,284][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:44:17,606][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:44:17,928][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:44:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:44:18,572][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:44:18,893][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:44:19,216][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:44:19,537][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:44:19,858][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:44:20,179][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:44:20,847][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:44:21,600][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:44:21,602][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:44:21,604][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:44:22,299][__main__][INFO] - Iteration 145 took 27s (11.89% Gen, 85.58% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 30m 5s. Estimated total time: 7h 39m 23s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 56s, 500 more iterations: 3h 49m 41s. [2026-03-25 16:44:22,302][__main__][INFO] - Starting iteration 145. [2026-03-25 16:44:22,305][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:44:22,306][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:44:25,593][__main__][INFO] - Number of regex retries in iteration 145: 0 [2026-03-25 16:44:25,594][__main__][INFO] - agents played in iteration 145 are Alice, Bob [2026-03-25 16:44:26,195][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:44:26,870][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:44:27,163][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:44:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:44:27,807][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:44:28,129][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:44:28,450][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:44:28,772][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:44:29,094][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:44:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:44:29,738][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:44:30,060][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:44:30,382][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:44:30,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:44:31,025][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:44:31,346][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:44:31,668][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:44:31,989][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:44:32,311][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:44:32,633][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:44:32,954][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:44:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:44:33,599][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:44:33,920][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:44:34,242][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:44:34,564][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:44:34,886][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:44:35,208][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:44:35,530][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:44:35,853][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:44:36,175][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:44:36,498][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:44:36,820][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:44:37,142][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:44:37,465][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:44:37,787][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:44:38,109][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:44:38,432][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:44:38,755][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:44:39,077][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:44:39,400][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:44:39,722][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:44:40,045][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:44:40,367][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:44:40,690][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:44:41,012][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:44:41,335][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:44:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:44:41,979][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:44:42,301][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:44:42,624][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:44:42,946][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:44:43,268][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:44:43,891][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:44:44,214][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:44:44,537][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:44:44,860][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:44:45,183][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:44:45,506][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:44:45,828][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:44:46,150][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:44:46,471][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:44:46,794][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:44:47,116][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:44:47,438][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:44:47,759][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:44:48,442][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:44:49,189][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:44:49,191][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:44:49,193][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:44:49,856][__main__][INFO] - Iteration 146 took 27s (11.93% Gen, 85.65% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 29m 26s. Estimated total time: 7h 39m 12s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 55s, 500 more iterations: 3h 49m 36s. [2026-03-25 16:44:49,859][__main__][INFO] - Starting iteration 146. [2026-03-25 16:44:49,862][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:44:49,862][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:44:53,099][__main__][INFO] - Number of regex retries in iteration 146: 0 [2026-03-25 16:44:53,100][__main__][INFO] - agents played in iteration 146 are Alice, Bob [2026-03-25 16:44:53,683][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:44:54,355][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:44:54,647][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:44:54,968][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:44:55,292][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:44:55,613][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:44:55,935][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:44:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:44:56,578][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:44:56,901][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:44:57,223][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:44:57,545][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:44:57,867][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:44:58,189][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:44:58,511][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:44:58,833][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:44:59,155][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:44:59,478][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:44:59,800][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:45:00,121][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:45:00,442][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:45:00,763][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:45:01,085][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:45:01,407][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:45:01,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:45:02,050][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:45:02,372][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:45:02,694][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:45:03,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:45:03,338][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:45:03,659][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:45:03,981][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:45:04,304][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:45:04,625][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:45:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:45:05,268][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:45:05,590][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:45:05,912][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:45:06,233][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:45:06,555][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:45:06,877][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:45:07,198][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:45:07,519][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:45:07,841][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:45:08,162][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:45:08,483][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:45:08,805][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:45:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:45:09,448][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:45:09,769][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:45:10,091][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:45:10,413][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:45:10,734][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:45:11,358][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:45:11,680][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:45:12,003][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:45:12,324][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:45:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:45:12,968][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:45:13,291][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:45:13,612][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:45:13,934][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:45:14,256][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:45:14,577][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:45:14,900][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:45:15,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:45:15,888][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:45:16,627][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:45:16,630][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:45:16,631][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:45:17,295][__main__][INFO] - Iteration 147 took 27s (11.80% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 27m 1s. Estimated total time: 7h 37m 14s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 43s, 500 more iterations: 3h 48m 37s. [2026-03-25 16:45:17,297][__main__][INFO] - Starting iteration 147. [2026-03-25 16:45:17,300][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:45:17,301][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:45:20,571][__main__][INFO] - Number of regex retries in iteration 147: 0 [2026-03-25 16:45:20,571][__main__][INFO] - agents played in iteration 147 are Alice, Bob [2026-03-25 16:45:21,201][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:45:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:45:22,167][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:45:22,489][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:45:22,811][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:45:23,132][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:45:23,455][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:45:23,776][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:45:24,098][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:45:24,419][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:45:24,740][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:45:25,061][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:45:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:45:25,705][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:45:26,027][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:45:26,349][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:45:26,672][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:45:26,994][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:45:27,317][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:45:27,639][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:45:27,961][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:45:28,282][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:45:28,602][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:45:28,924][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:45:29,246][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:45:29,568][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:45:29,890][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:45:30,212][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:45:30,533][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:45:30,856][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:45:31,178][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:45:31,500][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:45:31,821][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:45:32,143][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:45:32,465][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:45:32,786][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:45:33,108][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:45:33,429][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:45:33,751][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:45:34,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:45:34,395][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:45:34,716][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:45:35,038][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:45:35,361][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:45:35,683][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:45:36,004][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:45:36,325][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:45:36,647][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:45:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:45:37,290][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:45:37,611][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:45:37,933][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:45:38,255][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:45:38,878][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:45:39,200][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:45:39,522][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:45:39,844][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:45:40,166][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:45:40,487][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:45:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:45:41,131][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:45:41,452][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:45:41,775][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:45:42,097][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:45:42,419][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:45:42,740][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:45:43,408][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:45:44,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:45:44,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:45:44,156][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:45:44,822][__main__][INFO] - Iteration 148 took 27s (11.88% Gen, 85.69% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 28m 2s. Estimated total time: 7h 38m 43s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 52s, 500 more iterations: 3h 49m 21s. [2026-03-25 16:45:44,825][__main__][INFO] - Starting iteration 148. [2026-03-25 16:45:44,828][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:45:44,829][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:45:48,096][__main__][INFO] - Number of regex retries in iteration 148: 0 [2026-03-25 16:45:48,097][__main__][INFO] - agents played in iteration 148 are Alice, Bob [2026-03-25 16:45:48,712][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:45:49,383][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:45:49,677][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:45:49,999][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:45:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:45:50,643][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:45:50,964][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:45:51,286][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:45:51,609][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:45:51,932][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:45:52,254][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:45:52,576][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:45:52,899][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:45:53,221][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:45:53,542][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:45:53,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:45:54,187][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:45:54,508][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:45:54,830][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:45:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:45:55,475][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:45:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:45:56,120][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:45:56,441][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:45:56,764][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:45:57,086][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:45:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:45:57,731][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:45:58,055][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:45:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:45:58,700][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:45:59,022][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:45:59,345][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:45:59,668][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:45:59,991][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:46:00,313][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:46:00,636][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:46:00,957][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:46:01,280][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:46:01,601][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:46:01,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:46:02,247][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:46:02,568][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:46:02,890][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:46:03,212][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:46:03,534][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:46:03,856][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:46:04,177][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:46:04,498][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:46:04,819][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:46:05,141][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:46:05,462][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:46:05,783][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:46:06,406][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:46:06,727][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:46:07,049][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:46:07,372][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:46:07,693][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:46:08,014][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:46:08,335][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:46:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:46:08,978][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:46:09,300][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:46:09,620][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:46:09,942][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:46:10,263][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:46:10,929][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:46:11,671][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:46:11,673][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:46:11,675][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:46:12,337][__main__][INFO] - Iteration 149 took 27s (11.88% Gen, 85.71% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 27m 22s. Estimated total time: 7h 38m 30s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 51s, 500 more iterations: 3h 49m 15s. [2026-03-25 16:46:12,340][__main__][INFO] - Starting iteration 149. [2026-03-25 16:46:12,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:46:12,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:46:15,588][__main__][INFO] - Number of regex retries in iteration 149: 0 [2026-03-25 16:46:15,589][__main__][INFO] - agents played in iteration 149 are Alice, Bob [2026-03-25 16:46:16,178][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:46:16,850][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:46:17,142][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:46:17,464][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:46:17,786][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:46:18,106][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:46:18,428][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:46:18,749][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:46:19,071][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:46:19,393][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:46:19,715][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:46:20,037][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:46:20,358][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:46:20,680][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:46:21,001][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:46:21,323][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:46:21,644][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:46:21,965][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:46:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:46:22,609][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:46:22,930][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:46:23,253][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:46:23,574][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:46:23,896][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:46:24,217][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:46:24,539][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:46:24,860][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:46:25,181][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:46:25,503][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:46:25,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:46:26,146][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:46:26,467][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:46:26,788][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:46:27,110][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:46:27,431][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:46:27,752][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:46:28,074][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:46:28,395][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:46:28,716][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:46:29,038][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:46:29,361][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:46:29,682][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:46:30,002][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:46:30,324][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:46:30,646][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:46:30,967][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:46:31,288][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:46:31,610][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:46:31,932][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:46:32,253][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:46:32,574][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:46:32,896][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:46:33,217][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:46:33,848][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:46:34,169][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:46:34,491][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:46:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:46:35,135][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:46:35,457][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:46:35,779][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:46:36,101][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:46:36,422][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:46:36,742][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:46:37,063][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:46:37,385][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:46:37,707][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:46:38,373][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:46:39,122][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:46:39,124][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:46:39,126][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:46:39,777][__main__][INFO] - Iteration 150 took 27s (11.83% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 25m 40s. Estimated total time: 7h 37m 15s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 43s, 500 more iterations: 3h 48m 37s. [2026-03-25 16:46:39,779][__main__][INFO] - Starting iteration 150. [2026-03-25 16:46:39,782][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 2 and human policies 1. [2026-03-25 16:46:39,783][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:46:42,956][__main__][INFO] - Number of regex retries in iteration 150: 0 [2026-03-25 16:46:42,957][__main__][INFO] - agents played in iteration 150 are Alice, Bob [2026-03-25 16:46:43,516][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:46:44,188][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:46:44,481][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:46:44,803][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:46:45,124][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:46:45,445][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:46:45,765][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:46:46,087][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:46:46,408][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:46:46,729][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:46:47,050][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:46:47,374][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:46:47,696][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:46:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:46:48,341][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:46:48,662][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:46:48,984][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:46:49,305][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:46:49,627][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:46:49,949][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:46:50,271][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:46:50,594][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:46:50,915][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:46:51,236][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:46:51,558][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:46:51,880][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:46:52,203][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:46:52,525][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:46:52,847][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:46:53,169][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:46:53,489][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:46:53,812][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:46:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:46:54,455][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:46:54,776][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:46:55,097][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:46:55,418][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:46:55,741][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:46:56,062][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:46:56,383][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:46:56,705][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:46:57,026][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:46:57,348][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:46:57,670][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:46:57,993][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:46:58,315][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:46:58,636][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:46:58,957][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:46:59,279][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:46:59,601][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:46:59,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:47:00,244][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:47:00,564][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:47:01,187][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:47:01,509][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:47:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:47:02,153][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:47:02,474][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:47:02,795][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:47:03,117][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:47:03,439][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:47:03,760][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:47:04,082][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:47:04,402][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:47:04,724][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:47:05,045][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:47:05,722][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:47:06,470][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:47:06,472][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:47:06,474][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:47:07,700][__main__][INFO] - Iteration 151 took 27s (11.37% Gen, 84.23% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 33m 15s. Estimated total time: 7h 45m 19s. Time estimates for 10 more iterations: 4m 39s, 100 more iterations: 46m 31s, 500 more iterations: 3h 52m 39s. [2026-03-25 16:47:07,703][__main__][INFO] - Starting iteration 151. [2026-03-25 16:47:07,707][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:47:07,707][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:47:10,949][__main__][INFO] - Number of regex retries in iteration 151: 0 [2026-03-25 16:47:10,950][__main__][INFO] - agents played in iteration 151 are Alice, Bob [2026-03-25 16:47:11,506][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:47:12,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:47:12,480][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:47:12,802][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:47:13,125][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:47:13,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:47:13,768][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:47:14,091][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:47:14,414][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:47:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:47:15,060][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:47:15,382][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:47:15,705][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:47:16,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:47:16,349][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:47:16,672][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:47:16,994][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:47:17,315][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:47:17,638][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:47:17,961][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:47:18,284][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:47:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:47:18,927][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:47:19,248][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:47:19,569][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:47:19,892][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:47:20,214][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:47:20,535][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:47:20,856][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:47:21,178][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:47:21,500][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:47:21,820][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:47:22,142][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:47:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:47:22,785][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:47:23,107][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:47:23,429][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:47:23,750][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:47:24,073][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:47:24,395][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:47:24,716][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:47:25,038][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:47:25,360][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:47:25,681][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:47:26,002][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:47:26,323][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:47:26,644][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:47:26,965][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:47:27,287][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:47:27,609][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:47:27,930][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:47:28,252][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:47:28,575][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:47:29,196][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:47:29,518][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:47:29,839][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:47:30,161][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:47:30,482][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:47:30,804][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:47:31,125][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:47:31,446][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:47:31,768][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:47:32,090][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:47:32,411][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:47:32,733][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:47:33,054][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:47:33,719][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:47:34,462][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:47:34,464][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:47:34,466][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:47:35,121][__main__][INFO] - Iteration 152 took 27s (11.83% Gen, 85.78% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 24m 24s. Estimated total time: 7h 36m 55s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 27s. [2026-03-25 16:47:35,123][__main__][INFO] - Starting iteration 152. [2026-03-25 16:47:35,126][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:47:35,127][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:47:38,312][__main__][INFO] - Number of regex retries in iteration 152: 0 [2026-03-25 16:47:38,313][__main__][INFO] - agents played in iteration 152 are Alice, Bob [2026-03-25 16:47:38,851][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:47:39,522][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:47:39,814][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:47:40,135][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:47:40,457][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:47:40,778][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:47:41,099][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:47:41,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:47:41,742][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:47:42,063][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:47:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:47:42,708][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:47:43,029][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:47:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:47:43,672][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:47:43,994][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:47:44,316][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:47:44,636][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:47:44,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:47:45,278][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:47:45,600][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:47:45,922][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:47:46,243][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:47:46,564][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:47:46,885][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:47:47,205][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:47:47,526][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:47:47,847][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:47:48,169][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:47:48,490][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:47:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:47:49,134][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:47:49,455][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:47:49,777][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:47:50,098][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:47:50,420][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:47:50,741][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:47:51,063][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:47:51,384][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:47:51,706][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:47:52,027][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:47:52,349][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:47:52,672][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:47:52,993][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:47:53,315][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:47:53,636][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:47:53,957][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:47:54,277][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:47:54,599][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:47:54,920][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:47:55,242][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:47:55,563][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:47:55,885][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:47:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:47:56,827][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:47:57,149][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:47:57,471][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:47:57,794][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:47:58,115][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:47:58,436][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:47:58,757][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:47:59,078][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:47:59,400][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:47:59,721][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:48:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:48:00,364][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:48:01,030][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:48:01,780][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:48:01,782][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:48:01,783][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:48:02,438][__main__][INFO] - Iteration 153 took 27s (11.67% Gen, 85.93% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 22m 14s. Estimated total time: 7h 35m 12s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 36s. [2026-03-25 16:48:02,440][__main__][INFO] - Starting iteration 153. [2026-03-25 16:48:02,443][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:48:02,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:48:05,637][__main__][INFO] - Number of regex retries in iteration 153: 0 [2026-03-25 16:48:05,638][__main__][INFO] - agents played in iteration 153 are Alice, Bob [2026-03-25 16:48:06,181][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:48:06,866][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:48:07,158][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:48:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:48:07,802][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:48:08,123][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:48:08,444][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:48:08,766][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:48:09,088][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:48:09,410][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:48:09,732][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:48:10,053][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:48:10,375][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:48:10,697][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:48:11,018][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:48:11,339][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:48:11,660][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:48:11,981][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:48:12,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:48:12,624][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:48:12,946][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:48:13,267][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:48:13,588][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:48:13,910][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:48:14,233][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:48:14,555][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:48:14,876][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:48:15,197][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:48:15,518][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:48:15,840][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:48:16,161][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:48:16,483][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:48:16,805][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:48:17,126][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:48:17,447][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:48:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:48:18,091][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:48:18,413][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:48:18,735][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:48:19,057][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:48:19,378][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:48:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:48:20,021][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:48:20,343][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:48:20,664][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:48:20,985][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:48:21,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:48:21,627][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:48:21,948][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:48:22,270][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:48:22,592][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:48:22,915][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:48:23,237][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:48:23,872][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:48:24,194][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:48:24,516][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:48:24,837][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:48:25,159][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:48:25,481][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:48:25,803][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:48:26,125][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:48:26,448][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:48:26,769][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:48:27,092][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:48:27,414][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:48:27,737][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:48:28,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:48:29,262][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:48:29,268][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:48:29,270][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:48:29,949][__main__][INFO] - Iteration 154 took 27s (11.61% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 25m 1s. Estimated total time: 7h 38m 26s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 50s, 500 more iterations: 3h 49m 13s. [2026-03-25 16:48:29,953][__main__][INFO] - Starting iteration 154. [2026-03-25 16:48:29,958][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:48:29,959][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:48:33,337][__main__][INFO] - Number of regex retries in iteration 154: 0 [2026-03-25 16:48:33,338][__main__][INFO] - agents played in iteration 154 are Alice, Bob [2026-03-25 16:48:33,882][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:48:34,572][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:48:34,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:48:35,193][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:48:35,516][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:48:35,839][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:48:36,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:48:36,483][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:48:36,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:48:37,126][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:48:37,448][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:48:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:48:38,090][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:48:38,413][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:48:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:48:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:48:39,378][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:48:39,700][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:48:40,021][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:48:40,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:48:40,664][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:48:40,985][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:48:41,307][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:48:41,628][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:48:41,949][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:48:42,271][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:48:42,592][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:48:42,914][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:48:43,235][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:48:43,557][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:48:43,879][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:48:44,200][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:48:44,522][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:48:44,844][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:48:45,165][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:48:45,487][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:48:45,808][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:48:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:48:46,452][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:48:46,773][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:48:47,095][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:48:47,417][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:48:47,738][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:48:48,059][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:48:48,380][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:48:48,702][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:48:49,023][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:48:49,344][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:48:49,665][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:48:49,987][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:48:50,309][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:48:50,630][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:48:50,952][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:48:51,572][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:48:51,894][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:48:52,215][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:48:52,537][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:48:52,858][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:48:53,181][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:48:53,502][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:48:53,824][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:48:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:48:54,467][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:48:54,788][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:48:55,110][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:48:55,432][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:48:56,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:48:56,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:48:56,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:48:56,854][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:48:57,511][__main__][INFO] - Iteration 155 took 27s (12.26% Gen, 85.34% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 25m 21s. Estimated total time: 7h 39m 14s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 55s, 500 more iterations: 3h 49m 37s. [2026-03-25 16:48:57,514][__main__][INFO] - Starting iteration 155. [2026-03-25 16:48:57,517][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:48:57,517][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:49:00,716][__main__][INFO] - Number of regex retries in iteration 155: 0 [2026-03-25 16:49:00,717][__main__][INFO] - agents played in iteration 155 are Alice, Bob [2026-03-25 16:49:01,254][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:49:01,927][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:49:02,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:49:02,542][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:49:02,863][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:49:03,184][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:49:03,505][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:49:03,826][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:49:04,147][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:49:04,468][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:49:04,790][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:49:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:49:05,433][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:49:05,755][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:49:06,076][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:49:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:49:06,719][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:49:07,040][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:49:07,361][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:49:07,682][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:49:08,003][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:49:08,324][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:49:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:49:08,968][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:49:09,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:49:09,611][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:49:09,933][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:49:10,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:49:10,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:49:10,897][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:49:11,220][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:49:11,541][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:49:11,864][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:49:12,186][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:49:12,509][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:49:12,831][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:49:13,153][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:49:13,475][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:49:13,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:49:14,119][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:49:14,440][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:49:14,762][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:49:15,083][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:49:15,404][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:49:15,726][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:49:16,046][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:49:16,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:49:16,688][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:49:17,009][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:49:17,331][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:49:17,654][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:49:17,975][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:49:18,296][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:49:18,919][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:49:19,240][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:49:19,562][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:49:19,884][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:49:20,205][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:49:20,526][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:49:20,847][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:49:21,168][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:49:21,490][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:49:21,812][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:49:22,134][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:49:22,456][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:49:22,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:49:23,445][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:49:24,193][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:49:24,195][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:49:24,197][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:49:24,854][__main__][INFO] - Iteration 156 took 27s (11.70% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 21m 18s. Estimated total time: 7h 35m 38s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 49s. [2026-03-25 16:49:24,857][__main__][INFO] - Starting iteration 156. [2026-03-25 16:49:24,860][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:49:24,861][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:49:28,049][__main__][INFO] - Number of regex retries in iteration 156: 0 [2026-03-25 16:49:28,050][__main__][INFO] - agents played in iteration 156 are Alice, Bob [2026-03-25 16:49:28,588][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:49:29,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:49:29,554][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:49:29,875][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:49:30,197][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:49:30,518][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:49:30,839][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:49:31,161][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:49:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:49:31,803][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:49:32,125][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:49:32,447][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:49:32,768][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:49:33,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:49:33,410][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:49:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:49:34,054][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:49:34,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:49:34,696][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:49:35,018][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:49:35,340][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:49:35,661][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:49:35,983][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:49:36,304][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:49:36,625][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:49:36,946][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:49:37,267][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:49:37,588][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:49:37,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:49:38,231][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:49:38,554][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:49:38,876][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:49:39,198][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:49:39,520][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:49:39,842][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:49:40,164][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:49:40,486][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:49:40,808][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:49:41,131][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:49:41,455][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:49:41,776][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:49:42,100][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:49:42,421][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:49:42,743][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:49:43,065][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:49:43,386][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:49:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:49:44,030][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:49:44,352][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:49:44,675][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:49:44,996][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:49:45,319][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:49:45,641][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:49:46,268][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:49:46,590][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:49:46,912][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:49:47,235][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:49:47,557][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:49:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:49:48,201][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:49:48,523][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:49:48,845][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:49:49,167][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:49:49,489][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:49:49,812][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:49:50,133][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:49:50,822][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:49:51,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:49:51,569][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:49:51,570][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:49:52,228][__main__][INFO] - Iteration 157 took 27s (11.65% Gen, 85.94% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 21m 20s. Estimated total time: 7h 36m 8s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 36s, 500 more iterations: 3h 48m 4s. [2026-03-25 16:49:52,230][__main__][INFO] - Starting iteration 157. [2026-03-25 16:49:52,233][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:49:52,234][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:49:55,429][__main__][INFO] - Number of regex retries in iteration 157: 0 [2026-03-25 16:49:55,430][__main__][INFO] - agents played in iteration 157 are Alice, Bob [2026-03-25 16:49:55,965][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:49:56,632][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:49:56,924][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:49:57,248][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:49:57,569][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:49:57,891][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:49:58,213][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:49:58,534][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:49:58,854][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:49:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:49:59,497][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:49:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:50:00,139][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:50:00,461][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:50:00,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:50:01,106][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:50:01,427][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:50:01,748][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:50:02,069][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:50:02,391][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:50:02,713][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:50:03,034][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:50:03,357][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:50:03,679][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:50:04,000][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:50:04,321][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:50:04,643][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:50:04,964][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:50:05,286][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:50:05,610][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:50:05,932][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:50:06,255][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:50:06,577][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:50:06,898][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:50:07,219][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:50:07,541][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:50:07,862][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:50:08,183][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:50:08,505][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:50:08,826][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:50:09,149][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:50:09,472][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:50:09,793][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:50:10,116][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:50:10,438][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:50:10,759][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:50:11,081][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:50:11,402][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:50:11,724][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:50:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:50:12,367][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:50:12,689][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:50:13,010][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:50:13,629][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:50:13,951][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:50:14,272][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:50:14,594][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:50:14,916][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:50:15,237][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:50:15,558][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:50:15,880][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:50:16,200][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:50:16,521][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:50:16,842][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:50:17,163][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:50:17,484][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:50:18,145][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:50:18,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:50:18,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:50:18,890][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:50:19,545][__main__][INFO] - Iteration 158 took 27s (11.70% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 19m 58s. Estimated total time: 7h 35m 13s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 36s. [2026-03-25 16:50:19,548][__main__][INFO] - Starting iteration 158. [2026-03-25 16:50:19,551][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:50:19,551][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:50:22,743][__main__][INFO] - Number of regex retries in iteration 158: 0 [2026-03-25 16:50:22,744][__main__][INFO] - agents played in iteration 158 are Alice, Bob [2026-03-25 16:50:23,280][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:50:23,953][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:50:24,245][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:50:24,567][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:50:24,888][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:50:25,210][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:50:25,530][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:50:25,852][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:50:26,173][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:50:26,495][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:50:26,816][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:50:27,137][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:50:27,459][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:50:27,781][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:50:28,102][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:50:28,423][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:50:28,744][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:50:29,065][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:50:29,386][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:50:29,707][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:50:30,029][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:50:30,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:50:30,672][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:50:30,994][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:50:31,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:50:31,637][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:50:31,958][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:50:32,280][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:50:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:50:32,922][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:50:33,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:50:33,564][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:50:33,885][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:50:34,206][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:50:34,526][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:50:34,847][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:50:35,168][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:50:35,490][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:50:35,811][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:50:36,132][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:50:36,460][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:50:36,781][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:50:37,101][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:50:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:50:37,744][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:50:38,065][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:50:38,386][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:50:38,707][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:50:39,029][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:50:39,349][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:50:39,670][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:50:39,992][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:50:40,313][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:50:40,936][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:50:41,257][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:50:41,579][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:50:41,902][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:50:42,223][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:50:42,544][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:50:42,865][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:50:43,186][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:50:43,507][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:50:43,829][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:50:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:50:44,472][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:50:44,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:50:45,462][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:50:46,213][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:50:46,215][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:50:46,217][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:50:46,872][__main__][INFO] - Iteration 159 took 27s (11.69% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 19m 40s. Estimated total time: 7h 35m 22s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 32s, 500 more iterations: 3h 47m 41s. [2026-03-25 16:50:46,875][__main__][INFO] - Starting iteration 159. [2026-03-25 16:50:46,879][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:50:46,880][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:50:50,082][__main__][INFO] - Number of regex retries in iteration 159: 0 [2026-03-25 16:50:50,082][__main__][INFO] - agents played in iteration 159 are Alice, Bob [2026-03-25 16:50:50,624][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:50:51,296][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:50:51,588][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:50:51,910][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:50:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:50:52,553][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:50:52,875][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:50:53,196][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:50:53,517][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:50:53,837][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:50:54,158][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:50:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:50:54,801][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:50:55,123][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:50:55,444][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:50:55,766][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:50:56,088][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:50:56,411][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:50:56,734][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:50:57,056][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:50:57,378][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:50:57,699][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:50:58,021][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:50:58,343][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:50:58,665][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:50:58,987][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:50:59,310][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:50:59,632][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:50:59,954][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:51:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:51:00,600][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:51:00,922][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:51:01,244][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:51:01,566][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:51:01,889][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:51:02,212][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:51:02,533][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:51:02,855][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:51:03,178][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:51:03,500][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:51:03,821][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:51:04,144][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:51:04,465][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:51:04,786][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:51:05,108][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:51:05,429][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:51:05,752][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:51:06,074][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:51:06,396][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:51:06,717][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:51:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:51:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:51:07,681][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:51:08,303][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:51:08,625][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:51:08,946][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:51:09,268][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:51:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:51:09,910][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:51:10,232][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:51:10,553][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:51:10,875][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:51:11,196][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:51:11,518][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:51:11,840][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:51:12,162][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:51:12,828][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:51:13,566][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:51:13,569][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:51:13,570][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:51:14,226][__main__][INFO] - Iteration 160 took 27s (11.71% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 19m 38s. Estimated total time: 7h 35m 48s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 34s, 500 more iterations: 3h 47m 54s. [2026-03-25 16:51:14,228][__main__][INFO] - Starting iteration 160. [2026-03-25 16:51:14,231][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:51:14,232][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:51:17,429][__main__][INFO] - Number of regex retries in iteration 160: 0 [2026-03-25 16:51:17,430][__main__][INFO] - agents played in iteration 160 are Alice, Bob [2026-03-25 16:51:17,968][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:51:18,641][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:51:18,934][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:51:19,256][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:51:19,578][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:51:19,899][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:51:20,220][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:51:20,542][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:51:20,864][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:51:21,185][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:51:21,506][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:51:21,828][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:51:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:51:22,471][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:51:22,792][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:51:23,113][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:51:23,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:51:23,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:51:24,080][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:51:24,403][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:51:24,724][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:51:25,045][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:51:25,367][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:51:25,689][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:51:26,010][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:51:26,332][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:51:26,654][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:51:26,976][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:51:27,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:51:27,621][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:51:27,943][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:51:28,264][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:51:28,585][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:51:28,908][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:51:29,230][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:51:29,552][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:51:29,875][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:51:30,196][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:51:30,517][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:51:30,839][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:51:31,160][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:51:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:51:31,805][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:51:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:51:32,449][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:51:32,770][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:51:33,092][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:51:33,413][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:51:33,735][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:51:34,056][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:51:34,378][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:51:34,699][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:51:35,020][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:51:35,649][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:51:35,972][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:51:36,294][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:51:36,615][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:51:36,937][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:51:37,259][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:51:37,581][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:51:37,903][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:51:38,224][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:51:38,546][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:51:38,868][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:51:39,189][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:51:39,510][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:51:40,170][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:51:40,913][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:51:40,916][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:51:40,917][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:51:41,567][__main__][INFO] - Iteration 161 took 27s (11.70% Gen, 85.92% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 18m 59s. Estimated total time: 7h 35m 36s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 48s. [2026-03-25 16:51:41,569][__main__][INFO] - Starting iteration 161. [2026-03-25 16:51:41,572][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:51:41,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:51:44,789][__main__][INFO] - Number of regex retries in iteration 161: 0 [2026-03-25 16:51:44,790][__main__][INFO] - agents played in iteration 161 are Alice, Bob [2026-03-25 16:51:45,341][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:51:46,012][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:51:46,304][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:51:46,626][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:51:46,948][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:51:47,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:51:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:51:47,911][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:51:48,233][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:51:48,554][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:51:48,877][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:51:49,198][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:51:49,519][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:51:49,842][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:51:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:51:50,485][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:51:50,806][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:51:51,127][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:51:51,448][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:51:51,770][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:51:52,092][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:51:52,414][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:51:52,736][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:51:53,058][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:51:53,379][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:51:53,701][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:51:54,024][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:51:54,345][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:51:54,667][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:51:54,988][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:51:55,311][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:51:55,632][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:51:55,954][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:51:56,276][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:51:56,598][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:51:56,919][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:51:57,242][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:51:57,563][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:51:57,884][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:51:58,206][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:51:58,528][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:51:58,849][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:51:59,171][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:51:59,494][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:51:59,816][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:52:00,138][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:52:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:52:00,781][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:52:01,103][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:52:01,425][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:52:01,746][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:52:02,067][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:52:02,389][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:52:03,010][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:52:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:52:03,654][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:52:03,978][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:52:04,300][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:52:04,620][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:52:04,941][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:52:05,262][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:52:05,583][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:52:05,906][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:52:06,227][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:52:06,548][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:52:06,869][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:52:07,534][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:52:08,275][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:52:08,277][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:52:08,279][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:52:08,929][__main__][INFO] - Iteration 162 took 27s (11.76% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 18m 53s. Estimated total time: 7h 35m 58s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 59s. [2026-03-25 16:52:08,931][__main__][INFO] - Starting iteration 162. [2026-03-25 16:52:08,935][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:52:08,935][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:52:12,188][__main__][INFO] - Number of regex retries in iteration 162: 0 [2026-03-25 16:52:12,189][__main__][INFO] - agents played in iteration 162 are Alice, Bob [2026-03-25 16:52:12,748][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:52:13,428][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:52:13,720][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:52:14,043][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:52:14,365][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:52:14,687][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:52:15,009][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:52:15,331][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:52:15,654][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:52:15,976][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:52:16,298][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:52:16,619][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:52:16,941][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:52:17,263][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:52:17,586][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:52:17,909][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:52:18,232][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:52:18,554][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:52:18,876][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:52:19,198][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:52:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:52:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:52:20,163][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:52:20,484][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:52:20,806][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:52:21,128][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:52:21,452][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:52:21,773][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:52:22,096][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:52:22,417][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:52:22,738][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:52:23,058][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:52:23,380][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:52:23,701][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:52:24,023][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:52:24,344][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:52:24,666][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:52:24,987][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:52:25,307][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:52:25,630][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:52:25,951][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:52:26,272][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:52:26,594][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:52:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:52:27,238][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:52:27,559][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:52:27,881][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:52:28,202][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:52:28,524][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:52:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:52:29,166][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:52:29,487][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:52:29,808][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:52:30,430][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:52:30,751][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:52:31,073][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:52:31,394][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:52:31,716][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:52:32,037][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:52:32,359][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:52:32,680][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:52:33,001][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:52:33,323][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:52:33,645][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:52:33,966][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:52:34,288][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:52:34,953][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:52:35,690][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:52:35,692][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:52:35,694][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:52:36,345][__main__][INFO] - Iteration 163 took 27s (11.87% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 19m 19s. Estimated total time: 7h 36m 51s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 25s. [2026-03-25 16:52:36,347][__main__][INFO] - Starting iteration 163. [2026-03-25 16:52:36,351][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:52:36,351][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:52:39,578][__main__][INFO] - Number of regex retries in iteration 163: 0 [2026-03-25 16:52:39,579][__main__][INFO] - agents played in iteration 163 are Alice, Bob [2026-03-25 16:52:40,115][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:52:40,786][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:52:41,078][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:52:41,400][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:52:41,722][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:52:42,043][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:52:42,364][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:52:42,685][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:52:43,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:52:43,327][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:52:43,648][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:52:43,970][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:52:44,291][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:52:44,614][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:52:44,936][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:52:45,257][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:52:45,578][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:52:45,899][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:52:46,220][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:52:46,541][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:52:46,862][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:52:47,184][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:52:47,506][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:52:47,827][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:52:48,150][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:52:48,471][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:52:48,792][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:52:49,114][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:52:49,436][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:52:49,757][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:52:50,078][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:52:50,400][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:52:50,722][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:52:51,043][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:52:51,365][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:52:51,687][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:52:52,008][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:52:52,331][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:52:52,652][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:52:52,974][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:52:53,296][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:52:53,617][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:52:53,939][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:52:54,260][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:52:54,583][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:52:54,905][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:52:55,227][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:52:55,549][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:52:55,870][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:52:56,192][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:52:56,515][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:52:56,836][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:52:57,158][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:52:57,781][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:52:58,102][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:52:58,424][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:52:58,746][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:52:59,068][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:52:59,389][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:52:59,711][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:53:00,033][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:53:00,354][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:53:00,676][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:53:00,997][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:53:01,319][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:53:01,641][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:53:02,305][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:53:03,045][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:53:03,047][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:53:03,049][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:53:03,723][__main__][INFO] - Iteration 164 took 27s (11.79% Gen, 85.74% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 18m 13s. Estimated total time: 7h 36m 13s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 37s, 500 more iterations: 3h 48m 6s. [2026-03-25 16:53:03,725][__main__][INFO] - Starting iteration 164. [2026-03-25 16:53:03,728][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:53:03,729][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:53:06,949][__main__][INFO] - Number of regex retries in iteration 164: 0 [2026-03-25 16:53:06,950][__main__][INFO] - agents played in iteration 164 are Alice, Bob [2026-03-25 16:53:07,487][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:53:08,159][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:53:08,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:53:08,774][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:53:09,096][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:53:09,417][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:53:09,738][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:53:10,059][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:53:10,381][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:53:10,703][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:53:11,024][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:53:11,345][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:53:11,666][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:53:11,988][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:53:12,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:53:12,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:53:12,953][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:53:13,274][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:53:13,595][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:53:13,917][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:53:14,239][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:53:14,560][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:53:14,881][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:53:15,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:53:15,524][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:53:15,844][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:53:16,165][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:53:16,487][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:53:16,809][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:53:17,130][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:53:17,452][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:53:17,773][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:53:18,095][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:53:18,417][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:53:18,738][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:53:19,059][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:53:19,381][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:53:19,703][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:53:20,025][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:53:20,346][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:53:20,667][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:53:20,989][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:53:21,311][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:53:21,633][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:53:21,954][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:53:22,276][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:53:22,597][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:53:22,920][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:53:23,242][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:53:23,564][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:53:23,885][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:53:24,208][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:53:24,529][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:53:25,150][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:53:25,473][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:53:25,795][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:53:26,117][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:53:26,439][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:53:26,762][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:53:27,086][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:53:27,408][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:53:27,729][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:53:28,052][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:53:28,374][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:53:28,697][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:53:29,019][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:53:29,693][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:53:30,430][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:53:30,432][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:53:30,434][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:53:31,081][__main__][INFO] - Iteration 165 took 27s (11.77% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 17m 28s. Estimated total time: 7h 35m 54s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 57s. [2026-03-25 16:53:31,084][__main__][INFO] - Starting iteration 165. [2026-03-25 16:53:31,087][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:53:31,087][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:53:34,301][__main__][INFO] - Number of regex retries in iteration 165: 0 [2026-03-25 16:53:34,302][__main__][INFO] - agents played in iteration 165 are Alice, Bob [2026-03-25 16:53:34,841][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:53:35,533][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:53:35,825][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:53:36,147][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:53:36,469][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:53:36,790][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:53:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:53:37,434][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:53:37,756][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:53:38,077][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:53:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:53:38,720][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:53:39,041][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:53:39,362][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:53:39,683][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:53:40,004][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:53:40,325][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:53:40,646][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:53:40,967][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:53:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:53:41,612][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:53:41,934][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:53:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:53:42,578][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:53:42,899][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:53:43,220][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:53:43,542][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:53:43,864][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:53:44,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:53:44,506][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:53:44,828][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:53:45,149][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:53:45,470][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:53:45,792][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:53:46,114][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:53:46,436][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:53:46,756][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:53:47,078][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:53:47,399][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:53:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:53:48,044][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:53:48,365][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:53:48,686][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:53:49,008][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:53:49,330][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:53:49,652][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:53:49,973][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:53:50,295][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:53:50,617][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:53:50,939][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:53:51,260][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:53:51,581][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:53:51,903][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:53:52,524][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:53:52,845][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:53:53,166][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:53:53,487][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:53:53,808][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:53:54,129][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:53:54,451][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:53:54,772][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:53:55,094][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:53:55,416][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:53:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:53:56,059][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:53:56,381][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:53:57,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:53:57,787][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:53:57,789][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:53:57,791][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:53:58,439][__main__][INFO] - Iteration 166 took 27s (11.75% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 16m 59s. Estimated total time: 7h 35m 53s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 56s. [2026-03-25 16:53:58,441][__main__][INFO] - Starting iteration 166. [2026-03-25 16:53:58,445][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:53:58,446][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:54:01,658][__main__][INFO] - Number of regex retries in iteration 166: 0 [2026-03-25 16:54:01,658][__main__][INFO] - agents played in iteration 166 are Alice, Bob [2026-03-25 16:54:02,224][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:54:02,894][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:54:03,187][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:54:03,508][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:54:03,829][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:54:04,151][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:54:04,473][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:54:04,794][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:54:05,116][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:54:05,438][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:54:05,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:54:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:54:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:54:06,723][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:54:07,045][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:54:07,367][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:54:07,689][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:54:08,011][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:54:08,332][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:54:08,654][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:54:08,975][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:54:09,297][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:54:09,617][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:54:09,939][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:54:10,260][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:54:10,582][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:54:10,903][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:54:11,225][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:54:11,546][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:54:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:54:12,189][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:54:12,512][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:54:12,834][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:54:13,155][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:54:13,476][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:54:13,799][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:54:14,120][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:54:14,442][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:54:14,764][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:54:15,086][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:54:15,408][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:54:15,731][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:54:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:54:16,372][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:54:16,695][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:54:17,016][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:54:17,338][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:54:17,660][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:54:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:54:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:54:18,627][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:54:18,949][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:54:19,269][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:54:19,890][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:54:20,211][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:54:20,532][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:54:20,854][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:54:21,176][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:54:21,497][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:54:21,820][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:54:22,142][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:54:22,464][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:54:22,786][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:54:23,108][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:54:23,430][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:54:23,752][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:54:24,412][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:54:25,153][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:54:25,155][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:54:25,157][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:54:25,801][__main__][INFO] - Iteration 167 took 27s (11.74% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 16m 36s. Estimated total time: 7h 35m 57s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 58s. [2026-03-25 16:54:25,803][__main__][INFO] - Starting iteration 167. [2026-03-25 16:54:25,806][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:54:25,807][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:54:29,042][__main__][INFO] - Number of regex retries in iteration 167: 0 [2026-03-25 16:54:29,043][__main__][INFO] - agents played in iteration 167 are Alice, Bob [2026-03-25 16:54:29,635][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:54:30,301][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:54:30,594][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:54:30,915][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:54:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:54:31,559][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:54:31,881][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:54:32,203][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:54:32,525][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:54:32,847][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:54:33,169][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:54:33,491][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:54:33,812][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:54:34,135][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:54:34,457][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:54:34,778][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:54:35,100][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:54:35,423][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:54:35,746][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:54:36,068][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:54:36,390][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:54:36,712][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:54:37,035][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:54:37,357][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:54:37,678][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:54:38,000][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:54:38,323][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:54:38,645][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:54:38,966][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:54:39,288][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:54:39,609][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:54:39,931][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:54:40,254][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:54:40,576][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:54:40,897][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:54:41,219][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:54:41,541][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:54:41,864][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:54:42,187][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:54:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:54:42,831][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:54:43,155][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:54:43,478][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:54:43,802][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:54:44,124][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:54:44,446][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:54:44,769][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:54:45,093][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:54:45,416][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:54:45,739][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:54:46,063][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:54:46,386][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:54:46,709][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:54:47,332][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:54:47,655][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:54:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:54:48,301][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:54:48,623][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:54:48,947][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:54:49,269][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:54:49,592][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:54:49,914][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:54:50,236][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:54:50,559][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:54:50,881][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:54:51,205][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:54:51,865][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:54:52,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:54:52,606][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:54:52,608][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:54:53,253][__main__][INFO] - Iteration 168 took 27s (11.79% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 17m 38s. Estimated total time: 7h 37m 27s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 43s. [2026-03-25 16:54:53,255][__main__][INFO] - Starting iteration 168. [2026-03-25 16:54:53,258][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:54:53,259][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:54:56,482][__main__][INFO] - Number of regex retries in iteration 168: 0 [2026-03-25 16:54:56,483][__main__][INFO] - agents played in iteration 168 are Alice, Bob [2026-03-25 16:54:57,057][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:54:57,723][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:54:58,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:54:58,337][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:54:58,658][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:54:58,980][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:54:59,302][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:54:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:54:59,945][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:55:00,267][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:55:00,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:55:00,910][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:55:01,232][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:55:01,555][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:55:01,877][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:55:02,199][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:55:02,520][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:55:02,842][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:55:03,164][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:55:03,486][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:55:03,808][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:55:04,129][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:55:04,451][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:55:04,774][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:55:05,096][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:55:05,418][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:55:05,739][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:55:06,060][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:55:06,382][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:55:06,704][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:55:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:55:07,346][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:55:07,668][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:55:07,989][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:55:08,311][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:55:08,633][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:55:08,955][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:55:09,277][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:55:09,600][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:55:09,923][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:55:10,244][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:55:10,566][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:55:10,887][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:55:11,209][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:55:11,530][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:55:11,852][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:55:12,174][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:55:12,497][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:55:12,819][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:55:13,140][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:55:13,461][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:55:13,783][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:55:14,104][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:55:14,723][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:55:15,045][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:55:15,366][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:55:15,688][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:55:16,009][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:55:16,331][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:55:16,652][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:55:16,973][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:55:17,295][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:55:17,617][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:55:17,939][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:55:18,261][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:55:18,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:55:19,244][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:55:19,991][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:55:19,993][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:55:19,995][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:55:20,640][__main__][INFO] - Iteration 169 took 27s (11.78% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 16m 7s. Estimated total time: 7h 36m 23s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 11s. [2026-03-25 16:55:20,643][__main__][INFO] - Starting iteration 169. [2026-03-25 16:55:20,646][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:55:20,646][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:55:23,881][__main__][INFO] - Number of regex retries in iteration 169: 0 [2026-03-25 16:55:23,882][__main__][INFO] - agents played in iteration 169 are Alice, Bob [2026-03-25 16:55:24,468][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:55:25,135][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:55:25,426][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:55:25,749][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:55:26,071][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:55:26,393][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:55:26,715][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:55:27,037][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:55:27,359][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:55:27,681][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:55:28,002][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:55:28,324][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:55:28,646][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:55:28,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:55:29,290][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:55:29,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:55:29,933][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:55:30,255][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:55:30,577][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:55:30,898][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:55:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:55:31,541][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:55:31,864][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:55:32,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:55:32,507][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:55:32,829][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:55:33,151][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:55:33,472][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:55:33,795][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:55:34,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:55:34,440][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:55:34,762][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:55:35,083][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:55:35,405][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:55:35,727][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:55:36,049][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:55:36,370][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:55:36,692][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:55:37,015][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:55:37,336][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:55:37,658][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:55:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:55:38,301][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:55:38,622][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:55:38,944][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:55:39,266][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:55:39,588][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:55:39,910][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:55:40,231][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:55:40,553][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:55:40,874][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:55:41,196][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:55:41,517][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:55:42,137][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:55:42,460][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:55:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:55:43,102][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:55:43,424][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:55:43,745][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:55:44,068][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:55:44,390][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:55:44,711][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:55:45,033][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:55:45,356][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:55:45,677][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:55:45,999][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:55:46,659][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:55:47,401][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:55:47,403][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:55:47,405][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:55:48,049][__main__][INFO] - Iteration 170 took 27s (11.81% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 16m 1s. Estimated total time: 7h 36m 44s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 22s. [2026-03-25 16:55:48,052][__main__][INFO] - Starting iteration 170. [2026-03-25 16:55:48,055][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:55:48,056][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:55:51,278][__main__][INFO] - Number of regex retries in iteration 170: 0 [2026-03-25 16:55:51,279][__main__][INFO] - agents played in iteration 170 are Alice, Bob [2026-03-25 16:55:51,851][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:55:52,517][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:55:52,809][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:55:53,131][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:55:53,452][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:55:53,775][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:55:54,096][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:55:54,417][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:55:54,738][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:55:55,059][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:55:55,381][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:55:55,703][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:55:56,024][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:55:56,345][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:55:56,667][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:55:56,990][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:55:57,312][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:55:57,635][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:55:57,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:55:58,279][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:55:58,601][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:55:58,923][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:55:59,245][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:55:59,567][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:55:59,889][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:56:00,212][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:56:00,535][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:56:00,857][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:56:01,179][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:56:01,502][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:56:01,824][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:56:02,147][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:56:02,469][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:56:02,792][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:56:03,115][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:56:03,437][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:56:03,759][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:56:04,080][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:56:04,403][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:56:04,725][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:56:05,046][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:56:05,368][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:56:05,690][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:56:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:56:06,335][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:56:06,657][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:56:06,979][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:56:07,301][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:56:07,622][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:56:07,943][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:56:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:56:08,587][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:56:08,908][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:56:09,527][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:56:09,848][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:56:10,169][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:56:10,491][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:56:10,813][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:56:11,134][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:56:11,456][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:56:11,778][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:56:12,101][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:56:12,422][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:56:12,745][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:56:13,067][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:56:13,388][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:56:14,048][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:56:14,787][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:56:14,789][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:56:14,791][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:56:15,437][__main__][INFO] - Iteration 171 took 27s (11.77% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 15m 12s. Estimated total time: 7h 36m 23s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 11s. [2026-03-25 16:56:15,439][__main__][INFO] - Starting iteration 171. [2026-03-25 16:56:15,442][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:56:15,443][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:56:18,674][__main__][INFO] - Number of regex retries in iteration 171: 0 [2026-03-25 16:56:18,675][__main__][INFO] - agents played in iteration 171 are Alice, Bob [2026-03-25 16:56:19,216][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:56:19,883][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:56:20,175][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:56:20,497][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:56:20,819][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:56:21,141][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:56:21,463][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:56:21,784][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:56:22,106][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:56:22,427][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:56:22,750][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:56:23,072][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:56:23,395][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:56:23,717][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:56:24,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:56:24,360][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:56:24,681][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:56:25,003][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:56:25,325][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:56:25,647][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:56:25,968][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:56:26,291][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:56:26,613][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:56:26,935][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:56:27,257][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:56:27,579][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:56:27,901][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:56:28,222][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:56:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:56:28,866][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:56:29,187][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:56:29,509][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:56:29,832][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:56:30,154][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:56:30,475][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:56:30,797][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:56:31,118][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:56:31,440][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:56:31,762][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:56:32,084][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:56:32,406][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:56:32,727][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:56:33,049][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:56:33,371][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:56:33,693][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:56:34,015][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:56:34,337][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:56:34,658][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:56:34,980][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:56:35,301][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:56:35,623][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:56:35,944][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:56:36,266][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:56:36,887][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:56:37,209][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:56:37,531][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:56:37,853][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:56:38,174][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:56:38,496][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:56:38,818][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:56:39,141][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:56:39,462][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:56:39,784][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:56:40,106][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:56:40,427][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:56:40,750][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:56:41,415][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:56:42,168][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:56:42,170][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:56:42,172][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:56:42,811][__main__][INFO] - Iteration 172 took 27s (11.81% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 14m 31s. Estimated total time: 7h 36m 9s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 36s, 500 more iterations: 3h 48m 4s. [2026-03-25 16:56:42,813][__main__][INFO] - Starting iteration 172. [2026-03-25 16:56:42,817][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:56:42,817][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:56:46,052][__main__][INFO] - Number of regex retries in iteration 172: 0 [2026-03-25 16:56:46,053][__main__][INFO] - agents played in iteration 172 are Alice, Bob [2026-03-25 16:56:46,595][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:56:47,266][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:56:47,557][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:56:47,879][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:56:48,201][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:56:48,522][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:56:48,843][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:56:49,164][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:56:49,486][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:56:49,807][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:56:50,129][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:56:50,451][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:56:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:56:51,097][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:56:51,418][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:56:51,740][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:56:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:56:52,386][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:56:52,707][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:56:53,028][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:56:53,349][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:56:53,672][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:56:53,994][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:56:54,316][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:56:54,638][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:56:54,959][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:56:55,280][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:56:55,601][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:56:55,922][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:56:56,243][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:56:56,564][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:56:56,886][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:56:57,208][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:56:57,530][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:56:57,852][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:56:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:56:58,496][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:56:58,818][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:56:59,139][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:56:59,460][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:56:59,781][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:57:00,103][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:57:00,424][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:57:00,746][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:57:01,067][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:57:01,388][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:57:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:57:02,030][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:57:02,351][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:57:02,673][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:57:02,996][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:57:03,317][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:57:03,639][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:57:04,263][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:57:04,586][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:57:04,909][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:57:05,230][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:57:05,552][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:57:05,874][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:57:06,195][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:57:06,517][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:57:06,838][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:57:07,161][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:57:07,482][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:57:07,804][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:57:08,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:57:08,795][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:57:09,539][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:57:09,541][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:57:09,543][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:57:10,199][__main__][INFO] - Iteration 173 took 27s (11.82% Gen, 85.78% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 14m 18s. Estimated total time: 7h 36m 23s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 11s. [2026-03-25 16:57:10,202][__main__][INFO] - Starting iteration 173. [2026-03-25 16:57:10,205][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:57:10,205][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:57:13,481][__main__][INFO] - Number of regex retries in iteration 173: 0 [2026-03-25 16:57:13,482][__main__][INFO] - agents played in iteration 173 are Alice, Bob [2026-03-25 16:57:14,027][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:57:14,710][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:57:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:57:15,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:57:15,647][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:57:15,969][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:57:16,290][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:57:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:57:16,934][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:57:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:57:17,579][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:57:17,902][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:57:18,225][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:57:18,546][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:57:18,868][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:57:19,190][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:57:19,512][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:57:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:57:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:57:20,481][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:57:20,804][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:57:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:57:21,447][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:57:21,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:57:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:57:22,412][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:57:22,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:57:23,059][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:57:23,380][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:57:23,702][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:57:24,024][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:57:24,346][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:57:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:57:24,988][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:57:25,309][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:57:25,629][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:57:25,950][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:57:26,272][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:57:26,594][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:57:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:57:27,239][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:57:27,560][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:57:27,882][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:57:28,204][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:57:28,526][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:57:28,847][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:57:29,168][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:57:29,489][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:57:29,812][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:57:30,134][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:57:30,455][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:57:30,777][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:57:31,098][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:57:31,722][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:57:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:57:32,366][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:57:32,687][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:57:33,008][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:57:33,329][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:57:33,651][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:57:33,972][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:57:34,294][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:57:34,616][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:57:34,938][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:57:35,260][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:57:35,583][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:57:36,263][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:57:37,017][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:57:37,019][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:57:37,021][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:57:37,680][__main__][INFO] - Iteration 174 took 27s (11.92% Gen, 85.67% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 15m 23s. Estimated total time: 7h 37m 56s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 47s, 500 more iterations: 3h 48m 58s. [2026-03-25 16:57:37,683][__main__][INFO] - Starting iteration 174. [2026-03-25 16:57:37,686][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:57:37,686][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:57:40,906][__main__][INFO] - Number of regex retries in iteration 174: 0 [2026-03-25 16:57:40,907][__main__][INFO] - agents played in iteration 174 are Alice, Bob [2026-03-25 16:57:41,444][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:57:42,119][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:57:42,410][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:57:42,733][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:57:43,055][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:57:43,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:57:43,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:57:44,019][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:57:44,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:57:44,662][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:57:44,984][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:57:45,305][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:57:45,627][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:57:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:57:46,270][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:57:46,591][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:57:46,913][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:57:47,234][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:57:47,556][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:57:47,878][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:57:48,199][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:57:48,521][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:57:48,842][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:57:49,164][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:57:49,485][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:57:49,807][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:57:50,129][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:57:50,450][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:57:50,772][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:57:51,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:57:51,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:57:51,737][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:57:52,058][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:57:52,380][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:57:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:57:53,022][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:57:53,344][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:57:53,666][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:57:53,987][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:57:54,309][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:57:54,631][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:57:54,952][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:57:55,273][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:57:55,595][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:57:55,917][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:57:56,238][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:57:56,560][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:57:56,882][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:57:57,205][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:57:57,526][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:57:57,848][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:57:58,170][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:57:58,491][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:57:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:57:59,437][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:57:59,758][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:58:00,079][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:58:00,400][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:58:00,721][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:58:01,043][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:58:01,364][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:58:01,686][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:58:02,007][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:58:02,329][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:58:02,651][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:58:02,973][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:58:03,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:58:04,530][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:58:04,533][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:58:04,534][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:58:05,163][__main__][INFO] - Iteration 175 took 27s (11.72% Gen, 85.98% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 14m 57s. Estimated total time: 7h 37m 58s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 47s, 500 more iterations: 3h 48m 59s. [2026-03-25 16:58:05,165][__main__][INFO] - Starting iteration 175. [2026-03-25 16:58:05,169][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:58:05,169][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:58:08,400][__main__][INFO] - Number of regex retries in iteration 175: 0 [2026-03-25 16:58:08,401][__main__][INFO] - agents played in iteration 175 are Alice, Bob [2026-03-25 16:58:08,939][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:58:09,614][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:58:09,906][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:58:10,227][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:58:10,549][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:58:10,871][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:58:11,192][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:58:11,515][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:58:11,837][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:58:12,158][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:58:12,479][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:58:12,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:58:13,121][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:58:13,443][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:58:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:58:14,087][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:58:14,409][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:58:14,731][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:58:15,053][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:58:15,375][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:58:15,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:58:16,020][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:58:16,343][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:58:16,664][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:58:16,986][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:58:17,308][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:58:17,631][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:58:17,953][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:58:18,277][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:58:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:58:18,920][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:58:19,242][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:58:19,563][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:58:19,884][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:58:20,206][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:58:20,528][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:58:20,850][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:58:21,171][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:58:21,493][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:58:21,815][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:58:22,138][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:58:22,460][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:58:22,782][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:58:23,103][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:58:23,424][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:58:23,746][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:58:24,068][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:58:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:58:24,711][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:58:25,032][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:58:25,353][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:58:25,675][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:58:25,996][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:58:26,620][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:58:26,942][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:58:27,265][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:58:27,588][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:58:27,910][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:58:28,232][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:58:28,554][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:58:28,876][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:58:29,199][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:58:29,520][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:58:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:58:30,166][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:58:30,488][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:58:31,166][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:58:31,917][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:58:31,919][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:58:31,921][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:58:32,582][__main__][INFO] - Iteration 176 took 27s (11.79% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 13m 26s. Estimated total time: 7h 36m 54s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 41s, 500 more iterations: 3h 48m 27s. [2026-03-25 16:58:32,584][__main__][INFO] - Starting iteration 176. [2026-03-25 16:58:32,587][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:58:32,588][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:58:35,834][__main__][INFO] - Number of regex retries in iteration 176: 0 [2026-03-25 16:58:35,835][__main__][INFO] - agents played in iteration 176 are Alice, Bob [2026-03-25 16:58:36,379][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:58:37,057][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:58:37,384][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:58:37,712][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:58:38,035][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:58:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:58:38,679][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:58:39,002][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:58:39,323][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:58:39,645][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:58:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:58:40,288][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:58:40,609][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:58:40,931][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:58:41,252][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:58:41,574][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:58:41,896][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:58:42,218][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:58:42,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:58:42,861][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:58:43,183][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:58:43,504][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:58:43,826][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:58:44,148][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:58:44,470][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:58:44,792][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:58:45,113][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:58:45,435][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:58:45,757][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:58:46,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:58:46,402][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:58:46,724][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:58:47,045][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:58:47,366][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:58:47,687][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:58:48,008][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:58:48,330][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:58:48,651][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:58:48,976][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:58:49,298][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:58:49,619][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:58:49,942][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:58:50,264][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:58:50,586][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:58:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:58:51,229][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:58:51,552][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:58:51,873][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:58:52,195][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:58:52,517][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:58:52,838][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:58:53,159][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:58:53,480][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:58:54,104][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:58:54,427][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:58:54,749][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:58:55,072][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:58:55,395][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:58:55,717][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:58:56,039][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:58:56,360][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:58:56,681][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:58:57,004][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:58:57,327][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:58:57,649][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:58:57,970][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:58:58,640][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:58:59,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:58:59,467][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:58:59,468][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:59:00,106][__main__][INFO] - Iteration 177 took 27s (11.80% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 14m 44s. Estimated total time: 7h 38m 40s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 52s, 500 more iterations: 3h 49m 20s. [2026-03-25 16:59:00,108][__main__][INFO] - Starting iteration 177. [2026-03-25 16:59:00,111][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:59:00,112][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:59:03,359][__main__][INFO] - Number of regex retries in iteration 177: 0 [2026-03-25 16:59:03,360][__main__][INFO] - agents played in iteration 177 are Alice, Bob [2026-03-25 16:59:03,924][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:59:04,607][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:59:04,900][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:59:05,221][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:59:05,543][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:59:05,865][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:59:06,186][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:59:06,509][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:59:06,831][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:59:07,152][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:59:07,474][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:59:07,796][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:59:08,118][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:59:08,440][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:59:08,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:59:09,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:59:09,405][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:59:09,727][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:59:10,048][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:59:10,370][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:59:10,692][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:59:11,014][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:59:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:59:11,658][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:59:11,980][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:59:12,303][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:59:12,625][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:59:12,947][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:59:13,268][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:59:13,590][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:59:13,911][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:59:14,232][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:59:14,554][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:59:14,876][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:59:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:59:15,520][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:59:15,841][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:59:16,163][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:59:16,485][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:59:16,807][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:59:17,128][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:59:17,449][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:59:17,770][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:59:18,092][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:59:18,414][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:59:18,735][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:59:19,057][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:59:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:59:19,700][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:59:20,021][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:59:20,343][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:59:20,665][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:59:20,987][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:59:21,612][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:59:21,934][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:59:22,255][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:59:22,577][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:59:22,899][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:59:23,221][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:59:23,542][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:59:23,863][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:59:24,184][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:59:24,506][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:59:24,828][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:59:25,149][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:59:25,471][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:59:26,140][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:59:26,886][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:59:26,888][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:59:26,890][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:59:27,546][__main__][INFO] - Iteration 178 took 27s (11.84% Gen, 85.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 12m 53s. Estimated total time: 7h 37m 16s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 43s, 500 more iterations: 3h 48m 38s. [2026-03-25 16:59:27,549][__main__][INFO] - Starting iteration 178. [2026-03-25 16:59:27,552][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:59:27,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:59:30,803][__main__][INFO] - Number of regex retries in iteration 178: 0 [2026-03-25 16:59:30,804][__main__][INFO] - agents played in iteration 178 are Alice, Bob [2026-03-25 16:59:31,377][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:59:32,054][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:59:32,345][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 16:59:32,667][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 16:59:32,989][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 16:59:33,310][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 16:59:33,632][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 16:59:33,953][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 16:59:34,273][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 16:59:34,595][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 16:59:34,917][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 16:59:35,238][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 16:59:35,559][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 16:59:35,881][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 16:59:36,202][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 16:59:36,524][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 16:59:36,845][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 16:59:37,166][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 16:59:37,487][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 16:59:37,809][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 16:59:38,131][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 16:59:38,452][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 16:59:38,773][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 16:59:39,095][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 16:59:39,417][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 16:59:39,740][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 16:59:40,062][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 16:59:40,383][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 16:59:40,705][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 16:59:41,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 16:59:41,347][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 16:59:41,669][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 16:59:41,991][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 16:59:42,313][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 16:59:42,636][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 16:59:42,958][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 16:59:43,279][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 16:59:43,601][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 16:59:43,923][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 16:59:44,244][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 16:59:44,566][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 16:59:44,889][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 16:59:45,212][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 16:59:45,534][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 16:59:45,857][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 16:59:46,180][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 16:59:46,502][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 16:59:46,825][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 16:59:47,148][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 16:59:47,471][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 16:59:47,793][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 16:59:48,115][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 16:59:48,438][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 16:59:49,068][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 16:59:49,390][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 16:59:49,713][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 16:59:50,037][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 16:59:50,359][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 16:59:50,681][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 16:59:51,004][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 16:59:51,325][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 16:59:51,647][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 16:59:51,971][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 16:59:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 16:59:52,616][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 16:59:52,938][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 16:59:53,609][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 16:59:54,349][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 16:59:54,351][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 16:59:54,353][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 16:59:55,038][__main__][INFO] - Iteration 179 took 27s (11.83% Gen, 85.67% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 13m 16s. Estimated total time: 7h 38m 7s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 48s, 500 more iterations: 3h 49m 3s. [2026-03-25 16:59:55,040][__main__][INFO] - Starting iteration 179. [2026-03-25 16:59:55,044][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 16:59:55,044][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 16:59:58,306][__main__][INFO] - Number of regex retries in iteration 179: 0 [2026-03-25 16:59:58,307][__main__][INFO] - agents played in iteration 179 are Alice, Bob [2026-03-25 16:59:58,877][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 16:59:59,553][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 16:59:59,845][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:00:00,167][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:00:00,488][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:00:00,809][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:00:01,131][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:00:01,452][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:00:01,774][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:00:02,096][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:00:02,418][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:00:02,740][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:00:03,061][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:00:03,383][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:00:03,704][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:00:04,025][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:00:04,347][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:00:04,668][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:00:04,990][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:00:05,313][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:00:05,634][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:00:05,956][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:00:06,278][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:00:06,599][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:00:06,920][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:00:07,240][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:00:07,562][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:00:07,883][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:00:08,205][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:00:08,527][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:00:08,847][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:00:09,169][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:00:09,490][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:00:09,811][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:00:10,133][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:00:10,455][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:00:10,776][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:00:11,098][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:00:11,419][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:00:11,742][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:00:12,064][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:00:12,386][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:00:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:00:13,030][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:00:13,351][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:00:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:00:13,995][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:00:14,316][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:00:14,638][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:00:14,960][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:00:15,281][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:00:15,602][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:00:15,923][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:00:16,548][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:00:16,869][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:00:17,189][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:00:17,511][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:00:17,833][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:00:18,155][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:00:18,476][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:00:18,798][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:00:19,120][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:00:19,441][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:00:19,763][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:00:20,083][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:00:20,404][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:00:21,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:00:21,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:00:21,814][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:00:21,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:00:22,467][__main__][INFO] - Iteration 180 took 27s (11.89% Gen, 85.72% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 11m 46s. Estimated total time: 7h 37m 4s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 42s, 500 more iterations: 3h 48m 32s. [2026-03-25 17:00:22,470][__main__][INFO] - Starting iteration 180. [2026-03-25 17:00:22,473][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:00:22,473][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:00:25,672][__main__][INFO] - Number of regex retries in iteration 180: 0 [2026-03-25 17:00:25,673][__main__][INFO] - agents played in iteration 180 are Alice, Bob [2026-03-25 17:00:26,240][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:00:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:00:27,209][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:00:27,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:00:27,853][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:00:28,174][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:00:28,496][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:00:28,817][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:00:29,140][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:00:29,461][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:00:29,783][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:00:30,106][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:00:30,428][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:00:30,749][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:00:31,070][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:00:31,392][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:00:31,713][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:00:32,035][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:00:32,358][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:00:32,680][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:00:33,003][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:00:33,324][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:00:33,646][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:00:33,968][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:00:34,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:00:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:00:34,931][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:00:35,253][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:00:35,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:00:35,900][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:00:36,222][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:00:36,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:00:36,866][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:00:37,187][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:00:37,509][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:00:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:00:38,151][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:00:38,473][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:00:38,796][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:00:39,118][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:00:39,440][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:00:39,762][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:00:40,084][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:00:40,406][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:00:40,727][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:00:41,048][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:00:41,369][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:00:41,691][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:00:42,013][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:00:42,335][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:00:42,656][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:00:42,979][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:00:43,300][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:00:43,924][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:00:44,246][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:00:44,568][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:00:44,890][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:00:45,212][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:00:45,534][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:00:45,856][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:00:46,178][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:00:46,500][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:00:46,821][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:00:47,142][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:00:47,464][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:00:47,785][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:00:48,453][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:00:49,194][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:00:49,197][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:00:49,198][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:00:49,853][__main__][INFO] - Iteration 181 took 27s (11.69% Gen, 85.92% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 10m 36s. Estimated total time: 7h 36m 21s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 38s, 500 more iterations: 3h 48m 10s. [2026-03-25 17:00:49,855][__main__][INFO] - Starting iteration 181. [2026-03-25 17:00:49,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:00:49,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:00:53,046][__main__][INFO] - Number of regex retries in iteration 181: 0 [2026-03-25 17:00:53,047][__main__][INFO] - agents played in iteration 181 are Alice, Bob [2026-03-25 17:00:53,617][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:00:54,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:00:54,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:00:54,906][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:00:55,229][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:00:55,552][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:00:55,874][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:00:56,195][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:00:56,517][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:00:56,839][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:00:57,161][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:00:57,485][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:00:57,806][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:00:58,129][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:00:58,451][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:00:58,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:00:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:00:59,420][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:00:59,742][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:01:00,064][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:01:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:01:00,709][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:01:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:01:01,353][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:01:01,676][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:01:01,999][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:01:02,321][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:01:02,643][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:01:02,965][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:01:03,288][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:01:03,611][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:01:03,933][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:01:04,256][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:01:04,578][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:01:04,901][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:01:05,224][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:01:05,546][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:01:05,869][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:01:06,190][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:01:06,512][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:01:06,834][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:01:07,157][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:01:07,480][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:01:07,803][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:01:08,126][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:01:08,448][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:01:08,769][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:01:09,090][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:01:09,411][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:01:09,732][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:01:10,054][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:01:10,376][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:01:10,698][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:01:11,322][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:01:11,644][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:01:11,966][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:01:12,288][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:01:12,610][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:01:12,931][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:01:13,252][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:01:13,574][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:01:13,895][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:01:14,217][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:01:14,539][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:01:14,861][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:01:15,183][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:01:15,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:01:16,603][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:01:16,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:01:16,607][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:01:17,313][__main__][INFO] - Iteration 182 took 27s (11.61% Gen, 85.81% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 11m 22s. Estimated total time: 7h 37m 35s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 45s, 500 more iterations: 3h 48m 47s. [2026-03-25 17:01:17,315][__main__][INFO] - Starting iteration 182. [2026-03-25 17:01:17,318][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:01:17,319][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:01:20,522][__main__][INFO] - Number of regex retries in iteration 182: 0 [2026-03-25 17:01:20,523][__main__][INFO] - agents played in iteration 182 are Alice, Bob [2026-03-25 17:01:21,070][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:01:21,737][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:01:22,030][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:01:22,352][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:01:22,674][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:01:22,996][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:01:23,318][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:01:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:01:23,962][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:01:24,283][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:01:24,605][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:01:24,927][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:01:25,249][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:01:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:01:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:01:26,217][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:01:26,538][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:01:26,860][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:01:27,182][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:01:27,504][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:01:27,826][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:01:28,149][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:01:28,471][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:01:28,793][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:01:29,115][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:01:29,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:01:29,757][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:01:30,079][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:01:30,401][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:01:30,721][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:01:31,042][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:01:31,364][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:01:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:01:32,007][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:01:32,328][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:01:32,649][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:01:32,970][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:01:33,292][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:01:33,614][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:01:33,936][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:01:34,259][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:01:34,581][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:01:34,902][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:01:35,224][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:01:35,545][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:01:35,867][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:01:36,189][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:01:36,511][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:01:36,833][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:01:37,154][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:01:37,477][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:01:37,798][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:01:38,119][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:01:38,741][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:01:39,062][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:01:39,384][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:01:39,706][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:01:40,028][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:01:40,349][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:01:40,671][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:01:40,993][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:01:41,316][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:01:41,638][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:01:41,960][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:01:42,283][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:01:42,604][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:01:43,265][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:01:44,016][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:01:44,018][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:01:44,020][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:01:44,672][__main__][INFO] - Iteration 183 took 27s (11.71% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 9m 14s. Estimated total time: 7h 35m 54s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 57s. [2026-03-25 17:01:44,674][__main__][INFO] - Starting iteration 183. [2026-03-25 17:01:44,677][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:01:44,677][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:01:47,858][__main__][INFO] - Number of regex retries in iteration 183: 0 [2026-03-25 17:01:47,858][__main__][INFO] - agents played in iteration 183 are Alice, Bob [2026-03-25 17:01:48,394][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:01:49,064][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:01:49,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:01:49,678][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:01:49,999][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:01:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:01:50,642][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:01:50,963][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:01:51,284][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:01:51,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:01:51,928][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:01:52,250][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:01:52,571][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:01:52,893][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:01:53,215][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:01:53,538][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:01:53,859][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:01:54,181][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:01:54,503][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:01:54,824][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:01:55,147][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:01:55,468][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:01:55,790][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:01:56,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:01:56,434][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:01:56,756][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:01:57,079][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:01:57,401][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:01:57,723][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:01:58,047][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:01:58,368][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:01:58,690][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:01:59,012][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:01:59,334][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:01:59,657][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:01:59,978][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:02:00,299][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:02:00,620][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:02:00,942][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:02:01,264][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:02:01,586][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:02:01,908][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:02:02,229][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:02:02,551][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:02:02,872][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:02:03,194][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:02:03,516][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:02:03,837][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:02:04,158][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:02:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:02:04,801][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:02:05,123][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:02:05,446][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:02:06,069][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:02:06,390][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:02:06,712][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:02:07,034][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:02:07,356][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:02:07,678][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:02:08,000][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:02:08,322][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:02:08,643][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:02:08,964][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:02:09,286][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:02:09,607][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:02:09,929][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:02:10,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:02:11,499][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:02:11,502][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:02:11,503][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:02:12,122][__main__][INFO] - Iteration 184 took 27s (11.59% Gen, 86.15% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 10m 18s. Estimated total time: 7h 37m 26s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 43s. [2026-03-25 17:02:12,124][__main__][INFO] - Starting iteration 184. [2026-03-25 17:02:12,128][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:02:12,128][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:02:15,396][__main__][INFO] - Number of regex retries in iteration 184: 0 [2026-03-25 17:02:15,397][__main__][INFO] - agents played in iteration 184 are Alice, Bob [2026-03-25 17:02:15,947][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:02:16,632][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:02:16,923][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:02:17,247][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:02:17,570][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:02:17,892][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:02:18,215][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:02:18,537][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:02:18,859][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:02:19,182][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:02:19,504][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:02:19,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:02:20,149][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:02:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:02:20,794][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:02:21,116][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:02:21,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:02:21,760][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:02:22,081][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:02:22,404][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:02:22,727][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:02:23,050][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:02:23,371][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:02:23,693][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:02:24,014][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:02:24,335][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:02:24,655][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:02:24,977][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:02:25,299][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:02:25,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:02:25,942][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:02:26,264][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:02:26,585][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:02:26,907][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:02:27,228][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:02:27,551][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:02:27,872][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:02:28,194][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:02:28,515][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:02:28,836][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:02:29,159][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:02:29,481][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:02:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:02:30,123][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:02:30,446][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:02:30,767][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:02:31,088][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:02:31,410][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:02:31,731][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:02:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:02:32,373][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:02:32,695][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:02:33,016][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:02:33,639][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:02:33,961][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:02:34,283][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:02:34,605][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:02:34,926][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:02:35,247][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:02:35,569][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:02:35,890][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:02:36,212][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:02:36,534][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:02:36,855][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:02:37,177][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:02:37,499][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:02:38,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:02:38,911][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:02:38,913][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:02:38,915][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:02:39,568][__main__][INFO] - Iteration 185 took 27s (11.91% Gen, 85.70% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 9m 46s. Estimated total time: 7h 37m 21s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 44s, 500 more iterations: 3h 48m 40s. [2026-03-25 17:02:39,570][__main__][INFO] - Starting iteration 185. [2026-03-25 17:02:39,574][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:02:39,574][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:02:42,764][__main__][INFO] - Number of regex retries in iteration 185: 0 [2026-03-25 17:02:42,764][__main__][INFO] - agents played in iteration 185 are Alice, Bob [2026-03-25 17:02:43,306][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:02:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:02:44,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:02:44,594][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:02:44,916][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:02:45,237][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:02:45,560][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:02:45,881][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:02:46,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:02:46,525][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:02:46,847][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:02:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:02:47,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:02:47,810][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:02:48,130][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:02:48,451][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:02:48,774][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:02:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:02:49,417][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:02:49,738][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:02:50,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:02:50,381][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:02:50,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:02:51,026][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:02:51,347][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:02:51,668][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:02:51,989][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:02:52,310][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:02:52,631][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:02:52,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:02:53,274][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:02:53,596][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:02:53,918][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:02:54,240][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:02:54,561][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:02:54,883][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:02:55,203][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:02:55,525][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:02:55,846][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:02:56,168][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:02:56,490][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:02:56,812][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:02:57,133][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:02:57,455][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:02:57,777][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:02:58,099][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:02:58,421][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:02:58,743][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:02:59,065][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:02:59,387][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:02:59,709][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:03:00,030][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:03:00,352][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:03:00,976][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:03:01,297][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:03:01,619][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:03:01,940][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:03:02,262][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:03:02,583][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:03:02,906][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:03:03,226][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:03:03,548][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:03:03,869][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:03:04,191][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:03:04,511][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:03:04,832][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:03:05,501][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:03:06,236][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:03:06,239][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:03:06,241][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:03:06,892][__main__][INFO] - Iteration 186 took 27s (11.68% Gen, 85.93% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 7m 17s. Estimated total time: 7h 35m 19s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 39s. [2026-03-25 17:03:06,895][__main__][INFO] - Starting iteration 186. [2026-03-25 17:03:06,899][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:03:06,899][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:03:07,349][mllm.models.large_language_model_local][WARNING] - Response >B did not match regex: (|), retry 1/1 [2026-03-25 17:03:15,572][__main__][INFO] - Number of regex retries in iteration 186: 1 [2026-03-25 17:03:15,572][__main__][INFO] - agents played in iteration 186 are Alice, Bob [2026-03-25 17:03:16,134][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:03:16,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:03:17,088][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:03:17,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:03:17,727][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:03:18,045][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:03:18,365][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:03:18,683][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:03:19,002][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:03:19,322][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:03:19,640][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:03:19,958][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:03:20,277][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:03:20,595][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:03:20,914][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:03:21,233][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:03:21,551][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:03:21,869][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:03:22,188][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:03:22,506][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:03:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:03:23,143][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:03:23,461][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:03:23,780][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:03:24,099][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:03:24,418][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:03:24,736][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:03:25,056][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:03:25,375][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:03:25,694][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:03:26,012][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:03:26,331][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:03:26,649][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:03:26,969][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:03:27,289][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:03:27,609][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:03:27,929][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:03:28,248][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:03:28,567][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:03:28,887][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:03:29,207][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:03:29,526][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:03:29,844][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:03:30,165][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:03:30,486][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:03:30,807][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:03:31,126][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:03:31,446][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:03:31,766][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:03:32,085][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:03:32,404][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:03:32,724][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:03:33,045][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:03:33,724][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:03:34,046][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:03:34,366][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:03:34,685][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:03:35,004][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:03:35,324][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:03:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:03:35,963][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:03:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:03:36,601][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:03:36,922][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:03:37,241][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:03:37,560][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:03:38,235][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:03:38,995][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:03:38,997][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:03:38,999][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:03:39,655][__main__][INFO] - Iteration 187 took 32s (26.48% Gen, 71.51% Train). Generation: 8s, Training: 23s. Estimated remaining time: 7h 37m 22s. Estimated total time: 9h 5m 57s. Time estimates for 10 more iterations: 5m 27s, 100 more iterations: 54m 35s, 500 more iterations: 4h 32m 58s. [2026-03-25 17:03:39,658][__main__][INFO] - Starting iteration 187. [2026-03-25 17:03:39,661][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:03:39,661][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:03:42,932][__main__][INFO] - Number of regex retries in iteration 187: 0 [2026-03-25 17:03:42,933][__main__][INFO] - agents played in iteration 187 are Alice, Bob [2026-03-25 17:03:43,552][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:03:44,216][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:03:44,506][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:03:44,825][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:03:45,144][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:03:45,464][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:03:45,783][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:03:46,101][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:03:46,420][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:03:46,739][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:03:47,059][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:03:47,378][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:03:47,696][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:03:48,015][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:03:48,334][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:03:48,652][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:03:48,971][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:03:49,290][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:03:49,608][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:03:49,927][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:03:50,246][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:03:50,564][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:03:50,883][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:03:51,202][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:03:51,521][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:03:51,841][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:03:52,160][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:03:52,479][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:03:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:03:53,118][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:03:53,437][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:03:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:03:54,075][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:03:54,394][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:03:54,713][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:03:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:03:55,351][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:03:55,669][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:03:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:03:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:03:56,625][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:03:56,945][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:03:57,265][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:03:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:03:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:03:58,221][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:03:58,540][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:03:58,859][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:03:59,178][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:03:59,497][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:03:59,816][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:04:00,135][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:04:00,455][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:04:01,097][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:04:01,417][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:04:01,735][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:04:02,055][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:04:02,374][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:04:02,692][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:04:03,011][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:04:03,329][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:04:03,648][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:04:03,966][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:04:04,283][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:04:04,602][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:04:04,920][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:04:05,588][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:04:06,318][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:04:06,320][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:04:06,321][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:04:06,967][__main__][INFO] - Iteration 188 took 27s (11.98% Gen, 85.65% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 6m 5s. Estimated total time: 7h 35m 7s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 30s, 500 more iterations: 3h 47m 33s. [2026-03-25 17:04:06,969][__main__][INFO] - Starting iteration 188. [2026-03-25 17:04:06,972][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:04:06,973][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:04:10,241][__main__][INFO] - Number of regex retries in iteration 188: 0 [2026-03-25 17:04:10,241][__main__][INFO] - agents played in iteration 188 are Alice, Bob [2026-03-25 17:04:10,838][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:04:11,500][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:04:11,789][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:04:12,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:04:12,427][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:04:12,746][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:04:13,064][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:04:13,382][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:04:13,701][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:04:14,019][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:04:14,339][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:04:14,658][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:04:14,977][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:04:15,296][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:04:15,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:04:15,933][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:04:16,253][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:04:16,571][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:04:16,889][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:04:17,208][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:04:17,526][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:04:17,845][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:04:18,163][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:04:18,482][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:04:18,801][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:04:19,120][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:04:19,438][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:04:19,758][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:04:20,077][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:04:20,396][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:04:20,714][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:04:21,033][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:04:21,350][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:04:21,669][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:04:21,988][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:04:22,307][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:04:22,626][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:04:22,944][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:04:23,262][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:04:23,581][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:04:23,901][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:04:24,220][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:04:24,538][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:04:24,858][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:04:25,177][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:04:25,497][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:04:25,816][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:04:26,135][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:04:26,454][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:04:26,773][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:04:27,091][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:04:27,411][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:04:27,729][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:04:28,346][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:04:28,665][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:04:28,983][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:04:29,301][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:04:29,619][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:04:29,938][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:04:30,257][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:04:30,576][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:04:30,895][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:04:31,214][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:04:31,533][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:04:31,851][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:04:32,170][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:04:32,835][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:04:33,579][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:04:33,581][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:04:33,583][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:04:34,227][__main__][INFO] - Iteration 189 took 27s (11.99% Gen, 85.64% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 4m 46s. Estimated total time: 7h 34m 15s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 7s. [2026-03-25 17:04:34,229][__main__][INFO] - Starting iteration 189. [2026-03-25 17:04:34,232][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:04:34,233][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:04:37,457][__main__][INFO] - Number of regex retries in iteration 189: 0 [2026-03-25 17:04:37,458][__main__][INFO] - agents played in iteration 189 are Alice, Bob [2026-03-25 17:04:38,060][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:04:38,722][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:04:39,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:04:39,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:04:39,648][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:04:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:04:40,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:04:40,603][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:04:40,921][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:04:41,239][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:04:41,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:04:41,878][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:04:42,197][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:04:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:04:42,839][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:04:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:04:43,478][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:04:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:04:44,117][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:04:44,437][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:04:44,756][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:04:45,077][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:04:45,396][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:04:45,715][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:04:46,035][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:04:46,355][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:04:46,675][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:04:46,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:04:47,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:04:47,633][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:04:47,953][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:04:48,273][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:04:48,593][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:04:48,913][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:04:49,232][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:04:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:04:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:04:50,191][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:04:50,510][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:04:50,829][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:04:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:04:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:04:51,787][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:04:52,106][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:04:52,425][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:04:52,744][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:04:53,063][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:04:53,381][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:04:53,700][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:04:54,018][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:04:54,336][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:04:54,655][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:04:54,973][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:04:55,592][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:04:55,911][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:04:56,229][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:04:56,548][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:04:56,867][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:04:57,185][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:04:57,504][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:04:57,823][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:04:58,141][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:04:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:04:58,778][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:04:59,096][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:04:59,415][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:05:00,081][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:05:00,812][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:05:00,815][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:05:00,816][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:05:01,462][__main__][INFO] - Iteration 190 took 27s (11.84% Gen, 85.78% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 3m 53s. Estimated total time: 7h 33m 50s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 23s, 500 more iterations: 3h 46m 55s. [2026-03-25 17:05:01,464][__main__][INFO] - Starting iteration 190. [2026-03-25 17:05:01,467][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:05:01,468][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:05:04,723][__main__][INFO] - Number of regex retries in iteration 190: 0 [2026-03-25 17:05:04,723][__main__][INFO] - agents played in iteration 190 are Alice, Bob [2026-03-25 17:05:05,328][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:05:06,007][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:05:06,298][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:05:06,617][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:05:06,937][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:05:07,256][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:05:07,575][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:05:07,895][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:05:08,214][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:05:08,533][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:05:08,852][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:05:09,171][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:05:09,489][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:05:09,807][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:05:10,127][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:05:10,446][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:05:10,765][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:05:11,084][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:05:11,403][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:05:11,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:05:12,040][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:05:12,359][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:05:12,678][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:05:12,998][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:05:13,317][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:05:13,637][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:05:13,957][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:05:14,275][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:05:14,595][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:05:14,914][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:05:15,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:05:15,551][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:05:15,871][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:05:16,189][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:05:16,509][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:05:16,828][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:05:17,146][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:05:17,465][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:05:17,785][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:05:18,104][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:05:18,422][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:05:18,741][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:05:19,060][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:05:19,379][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:05:19,698][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:05:20,018][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:05:20,338][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:05:20,657][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:05:20,976][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:05:21,295][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:05:21,614][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:05:21,933][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:05:22,253][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:05:22,876][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:05:23,196][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:05:23,515][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:05:23,833][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:05:24,152][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:05:24,471][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:05:24,790][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:05:25,109][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:05:25,429][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:05:25,747][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:05:26,066][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:05:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:05:26,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:05:27,394][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:05:28,139][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:05:28,141][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:05:28,143][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:05:28,794][__main__][INFO] - Iteration 191 took 27s (11.91% Gen, 85.70% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 5m 3s. Estimated total time: 7h 35m 27s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 32s, 500 more iterations: 3h 47m 43s. [2026-03-25 17:05:28,798][__main__][INFO] - Starting iteration 191. [2026-03-25 17:05:28,801][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:05:28,802][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:05:32,055][__main__][INFO] - Number of regex retries in iteration 191: 0 [2026-03-25 17:05:32,056][__main__][INFO] - agents played in iteration 191 are Alice, Bob [2026-03-25 17:05:32,657][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:05:33,331][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:05:33,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:05:33,940][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:05:34,259][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:05:34,578][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:05:34,897][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:05:35,217][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:05:35,536][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:05:35,854][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:05:36,173][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:05:36,491][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:05:36,809][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:05:37,128][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:05:37,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:05:37,767][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:05:38,086][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:05:38,404][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:05:38,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:05:39,043][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:05:39,363][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:05:39,681][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:05:40,000][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:05:40,318][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:05:40,637][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:05:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:05:41,277][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:05:41,598][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:05:41,917][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:05:42,238][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:05:42,557][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:05:42,876][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:05:43,195][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:05:43,513][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:05:43,832][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:05:44,151][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:05:44,470][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:05:44,789][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:05:45,108][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:05:45,427][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:05:45,746][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:05:46,065][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:05:46,384][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:05:46,704][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:05:47,023][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:05:47,343][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:05:47,662][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:05:47,981][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:05:48,300][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:05:48,619][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:05:48,939][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:05:49,258][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:05:49,577][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:05:50,201][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:05:50,519][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:05:50,839][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:05:51,159][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:05:51,478][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:05:51,798][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:05:52,117][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:05:52,437][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:05:52,757][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:05:53,077][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:05:53,397][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:05:53,716][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:05:54,034][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:05:54,712][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:05:55,456][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:05:55,459][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:05:55,460][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:05:56,110][__main__][INFO] - Iteration 192 took 27s (11.91% Gen, 85.70% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 4m 18s. Estimated total time: 7h 35m 10s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 35s. [2026-03-25 17:05:56,113][__main__][INFO] - Starting iteration 192. [2026-03-25 17:05:56,116][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:05:56,116][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:05:59,420][__main__][INFO] - Number of regex retries in iteration 192: 0 [2026-03-25 17:05:59,421][__main__][INFO] - agents played in iteration 192 are Alice, Bob [2026-03-25 17:06:00,031][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:06:00,712][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:06:01,002][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:06:01,323][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:06:01,644][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:06:01,964][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:06:02,284][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:06:02,604][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:06:02,924][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:06:03,243][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:06:03,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:06:03,883][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:06:04,203][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:06:04,524][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:06:04,844][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:06:05,163][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:06:05,484][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:06:05,805][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:06:06,123][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:06:06,443][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:06:06,764][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:06:07,084][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:06:07,403][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:06:07,723][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:06:08,043][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:06:08,362][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:06:08,681][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:06:09,001][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:06:09,320][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:06:09,639][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:06:09,958][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:06:10,278][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:06:10,597][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:06:10,917][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:06:11,236][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:06:11,555][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:06:11,874][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:06:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:06:12,514][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:06:12,832][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:06:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:06:13,471][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:06:13,790][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:06:14,110][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:06:14,429][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:06:14,748][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:06:15,067][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:06:15,387][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:06:15,706][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:06:16,024][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:06:16,343][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:06:16,662][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:06:16,981][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:06:17,603][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:06:17,923][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:06:18,244][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:06:18,562][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:06:18,881][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:06:19,201][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:06:19,520][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:06:19,839][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:06:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:06:20,480][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:06:20,800][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:06:21,120][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:06:21,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:06:22,115][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:06:22,861][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:06:22,863][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:06:22,865][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:06:23,508][__main__][INFO] - Iteration 193 took 27s (12.07% Gen, 85.58% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 5m 14s. Estimated total time: 7h 36m 33s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 16s. [2026-03-25 17:06:23,510][__main__][INFO] - Starting iteration 193. [2026-03-25 17:06:23,513][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:06:23,514][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:06:26,751][__main__][INFO] - Number of regex retries in iteration 193: 0 [2026-03-25 17:06:26,752][__main__][INFO] - agents played in iteration 193 are Alice, Bob [2026-03-25 17:06:27,379][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:06:28,041][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:06:28,330][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:06:28,649][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:06:28,969][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:06:29,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:06:29,608][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:06:29,927][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:06:30,246][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:06:30,565][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:06:30,884][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:06:31,203][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:06:31,523][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:06:31,842][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:06:32,161][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:06:32,480][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:06:32,799][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:06:33,119][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:06:33,438][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:06:33,757][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:06:34,076][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:06:34,395][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:06:34,714][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:06:35,033][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:06:35,353][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:06:35,673][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:06:35,993][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:06:36,313][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:06:36,633][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:06:36,953][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:06:37,272][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:06:37,592][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:06:37,911][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:06:38,230][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:06:38,549][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:06:38,868][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:06:39,187][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:06:39,506][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:06:39,826][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:06:40,145][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:06:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:06:40,784][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:06:41,103][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:06:41,422][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:06:41,741][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:06:42,062][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:06:42,382][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:06:42,701][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:06:43,021][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:06:43,340][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:06:43,661][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:06:43,980][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:06:44,299][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:06:44,924][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:06:45,243][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:06:45,562][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:06:45,881][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:06:46,200][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:06:46,520][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:06:46,839][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:06:47,158][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:06:47,477][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:06:47,797][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:06:48,116][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:06:48,435][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:06:48,754][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:06:49,431][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:06:50,171][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:06:50,173][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:06:50,174][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:06:50,826][__main__][INFO] - Iteration 194 took 27s (11.85% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 3m 27s. Estimated total time: 7h 35m 14s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 37s. [2026-03-25 17:06:50,829][__main__][INFO] - Starting iteration 194. [2026-03-25 17:06:50,832][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:06:50,832][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:06:54,087][__main__][INFO] - Number of regex retries in iteration 194: 0 [2026-03-25 17:06:54,088][__main__][INFO] - agents played in iteration 194 are Alice, Bob [2026-03-25 17:06:54,653][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:06:55,313][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:06:55,602][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:06:55,921][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:06:56,240][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:06:56,559][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:06:56,879][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:06:57,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:06:57,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:06:57,837][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:06:58,156][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:06:58,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:06:58,794][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:06:59,114][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:06:59,432][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:06:59,751][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:07:00,070][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:07:00,389][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:07:00,708][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:07:01,027][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:07:01,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:07:01,664][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:07:01,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:07:02,304][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:07:02,623][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:07:02,942][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:07:03,262][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:07:03,581][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:07:03,900][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:07:04,219][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:07:04,539][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:07:04,859][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:07:05,179][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:07:05,498][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:07:05,817][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:07:06,136][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:07:06,455][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:07:06,775][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:07:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:07:07,414][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:07:07,733][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:07:08,052][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:07:08,371][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:07:08,690][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:07:09,010][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:07:09,329][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:07:09,648][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:07:09,967][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:07:10,286][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:07:10,606][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:07:10,925][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:07:11,245][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:07:11,565][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:07:12,185][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:07:12,505][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:07:12,824][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:07:13,143][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:07:13,462][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:07:13,782][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:07:14,102][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:07:14,422][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:07:14,742][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:07:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:07:15,382][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:07:15,703][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:07:16,022][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:07:16,699][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:07:17,454][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:07:17,456][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:07:17,458][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:07:18,149][__main__][INFO] - Iteration 195 took 27s (11.92% Gen, 85.55% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 3m 4s. Estimated total time: 7h 35m 18s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 39s. [2026-03-25 17:07:18,151][__main__][INFO] - Starting iteration 195. [2026-03-25 17:07:18,154][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:07:18,155][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:07:21,402][__main__][INFO] - Number of regex retries in iteration 195: 0 [2026-03-25 17:07:21,403][__main__][INFO] - agents played in iteration 195 are Alice, Bob [2026-03-25 17:07:21,985][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:07:22,646][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:07:22,935][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:07:23,254][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:07:23,573][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:07:23,892][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:07:24,211][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:07:24,529][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:07:24,849][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:07:25,169][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:07:25,488][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:07:25,807][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:07:26,127][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:07:26,445][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:07:26,764][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:07:27,083][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:07:27,402][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:07:27,722][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:07:28,040][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:07:28,359][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:07:28,680][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:07:28,999][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:07:29,318][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:07:29,637][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:07:29,956][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:07:30,277][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:07:30,597][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:07:30,916][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:07:31,234][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:07:31,552][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:07:31,871][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:07:32,190][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:07:32,509][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:07:32,828][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:07:33,146][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:07:33,466][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:07:33,785][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:07:34,104][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:07:34,422][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:07:34,741][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:07:35,060][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:07:35,379][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:07:35,698][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:07:36,017][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:07:36,336][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:07:36,655][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:07:36,973][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:07:37,293][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:07:37,612][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:07:37,932][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:07:38,251][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:07:38,570][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:07:38,890][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:07:39,506][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:07:39,825][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:07:40,144][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:07:40,463][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:07:40,782][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:07:41,101][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:07:41,420][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:07:41,739][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:07:42,058][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:07:42,378][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:07:42,697][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:07:43,017][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:07:43,336][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:07:44,001][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:07:44,748][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:07:44,750][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:07:44,752][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:07:45,408][__main__][INFO] - Iteration 196 took 27s (11.92% Gen, 85.67% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 1m 34s. Estimated total time: 7h 34m 15s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 7s. [2026-03-25 17:07:45,410][__main__][INFO] - Starting iteration 196. [2026-03-25 17:07:45,413][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:07:45,414][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:07:48,684][__main__][INFO] - Number of regex retries in iteration 196: 0 [2026-03-25 17:07:48,684][__main__][INFO] - agents played in iteration 196 are Alice, Bob [2026-03-25 17:07:49,258][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:07:49,922][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:07:50,211][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:07:50,531][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:07:50,850][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:07:51,168][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:07:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:07:51,806][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:07:52,124][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:07:52,443][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:07:52,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:07:53,081][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:07:53,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:07:53,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:07:54,037][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:07:54,357][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:07:54,676][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:07:54,995][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:07:55,315][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:07:55,634][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:07:55,952][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:07:56,270][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:07:56,589][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:07:56,908][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:07:57,228][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:07:57,546][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:07:57,865][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:07:58,184][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:07:58,502][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:07:58,821][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:07:59,141][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:07:59,459][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:07:59,777][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:08:00,097][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:08:00,416][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:08:00,735][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:08:01,054][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:08:01,372][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:08:01,690][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:08:02,009][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:08:02,328][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:08:02,647][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:08:02,966][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:08:03,284][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:08:03,602][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:08:03,920][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:08:04,239][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:08:04,558][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:08:04,877][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:08:05,195][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:08:05,515][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:08:05,835][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:08:06,154][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:08:06,789][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:08:07,108][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:08:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:08:07,747][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:08:08,067][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:08:08,385][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:08:08,703][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:08:09,023][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:08:09,342][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:08:09,661][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:08:09,980][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:08:10,299][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:08:10,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:08:11,275][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:08:12,010][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:08:12,012][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:08:12,014][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:08:12,666][__main__][INFO] - Iteration 197 took 27s (12.00% Gen, 85.60% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 1m 4s. Estimated total time: 7h 34m 13s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 6s. [2026-03-25 17:08:12,668][__main__][INFO] - Starting iteration 197. [2026-03-25 17:08:12,671][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:08:12,671][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:08:15,902][__main__][INFO] - Number of regex retries in iteration 197: 0 [2026-03-25 17:08:15,903][__main__][INFO] - agents played in iteration 197 are Alice, Bob [2026-03-25 17:08:16,480][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:08:17,133][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:08:17,422][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:08:17,741][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:08:18,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:08:18,380][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:08:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:08:19,018][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:08:19,337][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:08:19,656][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:08:19,975][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:08:20,294][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:08:20,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:08:20,931][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:08:21,250][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:08:21,569][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:08:21,888][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:08:22,207][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:08:22,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:08:22,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:08:23,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:08:23,481][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:08:23,799][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:08:24,118][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:08:24,438][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:08:24,757][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:08:25,075][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:08:25,395][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:08:25,714][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:08:26,034][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:08:26,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:08:26,674][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:08:26,992][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:08:27,312][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:08:27,632][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:08:27,951][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:08:28,271][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:08:28,590][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:08:28,908][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:08:29,228][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:08:29,548][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:08:29,868][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:08:30,188][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:08:30,508][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:08:30,827][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:08:31,146][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:08:31,466][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:08:31,786][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:08:32,105][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:08:32,425][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:08:32,745][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:08:33,064][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:08:33,384][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:08:33,999][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:08:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:08:34,638][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:08:34,958][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:08:35,278][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:08:35,598][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:08:35,918][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:08:36,237][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:08:36,556][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:08:36,875][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:08:37,195][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:08:37,514][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:08:37,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:08:38,488][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:08:39,221][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:08:39,223][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:08:39,225][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:08:39,817][__main__][INFO] - Iteration 198 took 27s (11.91% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 58m 51s. Estimated total time: 7h 32m 27s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 14s, 500 more iterations: 3h 46m 13s. [2026-03-25 17:08:39,819][__main__][INFO] - Starting iteration 198. [2026-03-25 17:08:39,822][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:08:39,823][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:08:43,070][__main__][INFO] - Number of regex retries in iteration 198: 0 [2026-03-25 17:08:43,071][__main__][INFO] - agents played in iteration 198 are Alice, Bob [2026-03-25 17:08:43,643][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:08:44,294][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:08:44,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:08:44,902][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:08:45,220][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:08:45,539][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:08:45,857][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:08:46,175][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:08:46,494][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:08:46,813][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:08:47,131][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:08:47,449][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:08:47,768][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:08:48,087][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:08:48,407][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:08:48,726][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:08:49,045][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:08:49,363][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:08:49,682][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:08:50,001][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:08:50,320][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:08:50,640][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:08:50,960][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:08:51,277][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:08:51,595][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:08:51,913][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:08:52,232][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:08:52,551][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:08:52,869][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:08:53,188][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:08:53,507][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:08:53,825][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:08:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:08:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:08:54,781][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:08:55,100][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:08:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:08:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:08:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:08:56,376][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:08:56,696][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:08:57,015][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:08:57,334][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:08:57,653][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:08:57,971][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:08:58,290][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:08:58,610][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:08:58,929][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:08:59,248][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:08:59,566][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:08:59,885][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:09:00,204][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:09:00,523][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:09:01,145][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:09:01,463][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:09:01,783][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:09:02,103][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:09:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:09:02,741][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:09:03,060][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:09:03,379][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:09:03,700][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:09:04,019][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:09:04,338][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:09:04,657][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:09:04,977][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:09:05,644][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:09:06,386][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:09:06,389][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:09:06,390][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:09:07,027][__main__][INFO] - Iteration 199 took 27s (11.94% Gen, 85.71% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 59m 23s. Estimated total time: 7h 33m 26s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 20s, 500 more iterations: 3h 46m 43s. [2026-03-25 17:09:07,030][__main__][INFO] - Starting iteration 199. [2026-03-25 17:09:07,033][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:09:07,033][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:09:10,278][__main__][INFO] - Number of regex retries in iteration 199: 0 [2026-03-25 17:09:10,279][__main__][INFO] - agents played in iteration 199 are Alice, Bob [2026-03-25 17:09:10,866][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:09:11,518][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:09:11,808][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:09:12,127][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:09:12,445][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:09:12,763][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:09:13,082][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:09:13,400][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:09:13,719][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:09:14,038][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:09:14,358][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:09:14,677][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:09:14,995][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:09:15,313][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:09:15,631][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:09:15,950][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:09:16,269][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:09:16,588][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:09:16,907][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:09:17,225][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:09:17,545][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:09:17,863][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:09:18,182][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:09:18,502][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:09:18,821][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:09:19,140][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:09:19,459][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:09:19,779][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:09:20,098][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:09:20,418][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:09:20,737][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:09:21,055][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:09:21,373][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:09:21,692][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:09:22,011][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:09:22,330][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:09:22,648][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:09:22,967][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:09:23,285][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:09:23,604][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:09:23,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:09:24,241][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:09:24,560][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:09:24,879][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:09:25,198][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:09:25,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:09:25,834][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:09:26,154][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:09:26,472][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:09:26,791][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:09:27,110][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:09:27,429][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:09:27,749][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:09:28,364][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:09:28,683][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:09:29,001][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:09:29,321][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:09:29,640][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:09:29,960][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:09:30,279][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:09:30,598][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:09:30,917][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:09:31,237][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:09:31,556][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:09:31,875][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:09:32,195][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:09:32,851][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:09:33,584][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:09:33,586][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:09:33,588][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:09:34,240][__main__][INFO] - Iteration 200 took 27s (11.93% Gen, 85.67% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 58m 58s. Estimated total time: 7h 33m 28s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 20s, 500 more iterations: 3h 46m 44s. [2026-03-25 17:09:34,242][__main__][INFO] - Starting iteration 200. [2026-03-25 17:09:34,245][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 3 and human policies 1. [2026-03-25 17:09:34,246][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:09:37,504][__main__][INFO] - Number of regex retries in iteration 200: 0 [2026-03-25 17:09:37,505][__main__][INFO] - agents played in iteration 200 are Alice, Bob [2026-03-25 17:09:38,093][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:09:38,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:09:39,050][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:09:39,369][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:09:39,687][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:09:40,008][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:09:40,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:09:40,650][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:09:40,969][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:09:41,288][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:09:41,607][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:09:41,927][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:09:42,245][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:09:42,565][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:09:42,885][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:09:43,204][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:09:43,522][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:09:43,841][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:09:44,160][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:09:44,480][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:09:44,800][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:09:45,120][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:09:45,439][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:09:45,758][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:09:46,077][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:09:46,398][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:09:46,718][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:09:47,038][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:09:47,357][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:09:47,677][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:09:47,997][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:09:48,317][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:09:48,636][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:09:48,957][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:09:49,276][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:09:49,594][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:09:49,913][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:09:50,233][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:09:50,552][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:09:50,871][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:09:51,190][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:09:51,509][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:09:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:09:52,147][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:09:52,466][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:09:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:09:53,102][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:09:53,420][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:09:53,738][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:09:54,057][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:09:54,377][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:09:54,695][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:09:55,014][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:09:55,647][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:09:55,966][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:09:56,284][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:09:56,603][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:09:56,921][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:09:57,241][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:09:57,560][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:09:57,878][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:09:58,198][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:09:58,517][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:09:58,835][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:09:59,154][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:09:59,472][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:10:00,135][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:10:00,871][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:10:00,873][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:10:00,875][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:10:02,143][__main__][INFO] - Iteration 201 took 27s (11.68% Gen, 83.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 10m 1s. Estimated total time: 7h 44m 59s. Time estimates for 10 more iterations: 4m 38s, 100 more iterations: 46m 29s, 500 more iterations: 3h 52m 29s. [2026-03-25 17:10:02,146][__main__][INFO] - Starting iteration 201. [2026-03-25 17:10:02,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:10:02,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:10:05,443][__main__][INFO] - Number of regex retries in iteration 201: 0 [2026-03-25 17:10:05,444][__main__][INFO] - agents played in iteration 201 are Alice, Bob [2026-03-25 17:10:06,025][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:10:06,685][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:10:06,976][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:10:07,295][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:10:07,614][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:10:07,933][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:10:08,252][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:10:08,570][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:10:08,888][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:10:09,206][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:10:09,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:10:09,843][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:10:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:10:10,481][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:10:10,800][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:10:11,119][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:10:11,438][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:10:11,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:10:12,078][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:10:12,397][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:10:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:10:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:10:13,353][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:10:13,671][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:10:13,988][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:10:14,306][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:10:14,626][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:10:14,944][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:10:15,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:10:15,581][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:10:15,900][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:10:16,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:10:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:10:16,857][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:10:17,176][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:10:17,494][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:10:17,814][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:10:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:10:18,451][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:10:18,769][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:10:19,087][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:10:19,406][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:10:19,723][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:10:20,042][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:10:20,360][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:10:20,679][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:10:20,999][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:10:21,318][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:10:21,637][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:10:21,955][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:10:22,273][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:10:22,591][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:10:22,908][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:10:23,544][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:10:23,863][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:10:24,182][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:10:24,501][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:10:24,819][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:10:25,139][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:10:25,458][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:10:25,777][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:10:26,095][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:10:26,414][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:10:26,732][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:10:27,051][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:10:27,371][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:10:28,038][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:10:28,772][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:10:28,774][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:10:28,776][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:10:29,429][__main__][INFO] - Iteration 202 took 27s (12.08% Gen, 85.52% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 59m 16s. Estimated total time: 7h 34m 41s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 20s. [2026-03-25 17:10:29,432][__main__][INFO] - Starting iteration 202. [2026-03-25 17:10:29,435][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:10:29,435][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:10:32,718][__main__][INFO] - Number of regex retries in iteration 202: 0 [2026-03-25 17:10:32,719][__main__][INFO] - agents played in iteration 202 are Alice, Bob [2026-03-25 17:10:33,261][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:10:33,923][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:10:34,214][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:10:34,533][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:10:34,852][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:10:35,171][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:10:35,489][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:10:35,807][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:10:36,126][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:10:36,444][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:10:36,762][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:10:37,080][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:10:37,400][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:10:37,719][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:10:38,038][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:10:38,358][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:10:38,678][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:10:38,996][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:10:39,315][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:10:39,634][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:10:39,954][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:10:40,273][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:10:40,591][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:10:40,910][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:10:41,229][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:10:41,548][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:10:41,867][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:10:42,186][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:10:42,505][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:10:42,824][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:10:43,143][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:10:43,463][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:10:43,781][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:10:44,099][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:10:44,419][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:10:44,738][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:10:45,059][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:10:45,378][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:10:45,696][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:10:46,015][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:10:46,334][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:10:46,652][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:10:46,972][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:10:47,290][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:10:47,609][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:10:47,927][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:10:48,246][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:10:48,565][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:10:48,883][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:10:49,202][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:10:49,520][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:10:49,838][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:10:50,158][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:10:50,794][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:10:51,114][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:10:51,433][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:10:51,752][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:10:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:10:52,388][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:10:52,709][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:10:53,030][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:10:53,349][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:10:53,668][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:10:53,987][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:10:54,306][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:10:54,624][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:10:55,288][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:10:56,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:10:56,039][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:10:56,041][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:10:56,697][__main__][INFO] - Iteration 203 took 27s (12.04% Gen, 85.54% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 58m 31s. Estimated total time: 7h 34m 23s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 11s. [2026-03-25 17:10:56,700][__main__][INFO] - Starting iteration 203. [2026-03-25 17:10:56,703][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:10:56,703][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:11:00,009][__main__][INFO] - Number of regex retries in iteration 203: 0 [2026-03-25 17:11:00,010][__main__][INFO] - agents played in iteration 203 are Alice, Bob [2026-03-25 17:11:00,588][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:11:01,262][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:11:01,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:11:01,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:11:02,193][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:11:02,513][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:11:02,834][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:11:03,153][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:11:03,472][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:11:03,791][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:11:04,110][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:11:04,431][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:11:04,751][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:11:05,070][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:11:05,390][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:11:05,710][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:11:06,030][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:11:06,348][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:11:06,667][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:11:06,985][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:11:07,304][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:11:07,622][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:11:07,941][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:11:08,260][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:11:08,578][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:11:08,898][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:11:09,217][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:11:09,536][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:11:09,854][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:11:10,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:11:10,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:11:10,811][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:11:11,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:11:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:11:11,767][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:11:12,086][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:11:12,404][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:11:12,722][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:11:13,041][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:11:13,360][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:11:13,679][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:11:13,998][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:11:14,317][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:11:14,637][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:11:14,956][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:11:15,275][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:11:15,593][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:11:15,911][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:11:16,231][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:11:16,549][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:11:16,868][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:11:17,187][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:11:17,505][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:11:18,126][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:11:18,444][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:11:18,763][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:11:19,081][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:11:19,401][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:11:19,719][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:11:20,038][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:11:20,357][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:11:20,676][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:11:20,994][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:11:21,314][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:11:21,632][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:11:21,952][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:11:22,615][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:11:23,348][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:11:23,350][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:11:23,352][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:11:24,000][__main__][INFO] - Iteration 204 took 27s (12.11% Gen, 85.51% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 58m 38s. Estimated total time: 7h 34m 58s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 29s, 500 more iterations: 3h 47m 29s. [2026-03-25 17:11:24,002][__main__][INFO] - Starting iteration 204. [2026-03-25 17:11:24,005][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:11:24,006][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:11:27,235][__main__][INFO] - Number of regex retries in iteration 204: 0 [2026-03-25 17:11:27,236][__main__][INFO] - agents played in iteration 204 are Alice, Bob [2026-03-25 17:11:27,800][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:11:28,462][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:11:28,752][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:11:29,072][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:11:29,391][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:11:29,711][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:11:30,030][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:11:30,349][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:11:30,667][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:11:30,985][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:11:31,305][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:11:31,625][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:11:31,944][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:11:32,263][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:11:32,581][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:11:32,900][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:11:33,219][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:11:33,538][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:11:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:11:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:11:34,494][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:11:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:11:35,133][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:11:35,452][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:11:35,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:11:36,089][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:11:36,407][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:11:36,725][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:11:37,043][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:11:37,362][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:11:37,681][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:11:38,000][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:11:38,319][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:11:38,638][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:11:38,957][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:11:39,276][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:11:39,594][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:11:39,913][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:11:40,233][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:11:40,551][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:11:40,870][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:11:41,190][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:11:41,509][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:11:41,829][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:11:42,147][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:11:42,466][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:11:42,785][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:11:43,104][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:11:43,423][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:11:43,741][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:11:44,059][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:11:44,378][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:11:44,697][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:11:45,311][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:11:45,630][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:11:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:11:46,266][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:11:46,584][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:11:46,902][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:11:47,221][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:11:47,539][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:11:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:11:48,176][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:11:48,495][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:11:48,814][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:11:49,134][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:11:49,797][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:11:50,541][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:11:50,543][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:11:50,544][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:11:51,195][__main__][INFO] - Iteration 205 took 27s (11.88% Gen, 85.72% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 56m 23s. Estimated total time: 7h 33m 10s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 35s. [2026-03-25 17:11:51,197][__main__][INFO] - Starting iteration 205. [2026-03-25 17:11:51,200][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:11:51,201][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:11:54,435][__main__][INFO] - Number of regex retries in iteration 205: 0 [2026-03-25 17:11:54,436][__main__][INFO] - agents played in iteration 205 are Alice, Bob [2026-03-25 17:11:54,972][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:11:55,630][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:11:55,919][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:11:56,240][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:11:56,559][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:11:56,879][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:11:57,199][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:11:57,518][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:11:57,838][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:11:58,157][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:11:58,475][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:11:58,794][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:11:59,113][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:11:59,431][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:11:59,749][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:12:00,068][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:12:00,386][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:12:00,705][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:12:01,024][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:12:01,342][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:12:01,661][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:12:01,980][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:12:02,299][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:12:02,619][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:12:02,937][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:12:03,255][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:12:03,575][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:12:03,893][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:12:04,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:12:04,530][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:12:04,849][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:12:05,168][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:12:05,487][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:12:05,805][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:12:06,123][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:12:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:12:06,761][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:12:07,081][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:12:07,399][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:12:07,719][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:12:08,038][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:12:08,357][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:12:08,675][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:12:08,993][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:12:09,313][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:12:09,633][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:12:09,953][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:12:10,272][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:12:10,590][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:12:10,910][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:12:11,228][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:12:11,548][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:12:11,867][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:12:12,489][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:12:12,808][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:12:13,127][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:12:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:12:13,766][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:12:14,086][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:12:14,405][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:12:14,725][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:12:15,045][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:12:15,365][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:12:15,684][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:12:16,003][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:12:16,323][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:12:16,989][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:12:17,730][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:12:17,733][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:12:17,735][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:12:18,383][__main__][INFO] - Iteration 206 took 27s (11.90% Gen, 85.71% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 55m 50s. Estimated total time: 7h 33m 4s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 32s. [2026-03-25 17:12:18,386][__main__][INFO] - Starting iteration 206. [2026-03-25 17:12:18,389][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:12:18,389][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:12:21,683][__main__][INFO] - Number of regex retries in iteration 206: 0 [2026-03-25 17:12:21,684][__main__][INFO] - agents played in iteration 206 are Alice, Bob [2026-03-25 17:12:22,260][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:12:22,918][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:12:23,207][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:12:23,526][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:12:23,846][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:12:24,165][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:12:24,484][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:12:24,803][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:12:25,122][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:12:25,440][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:12:25,759][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:12:26,078][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:12:26,397][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:12:26,716][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:12:27,035][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:12:27,355][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:12:27,673][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:12:27,991][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:12:28,310][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:12:28,628][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:12:28,947][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:12:29,266][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:12:29,586][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:12:29,905][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:12:30,224][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:12:30,542][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:12:30,861][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:12:31,180][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:12:31,499][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:12:31,818][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:12:32,137][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:12:32,456][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:12:32,774][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:12:33,093][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:12:33,413][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:12:33,732][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:12:34,050][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:12:34,368][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:12:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:12:35,004][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:12:35,323][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:12:35,645][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:12:35,964][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:12:36,283][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:12:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:12:36,921][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:12:37,240][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:12:37,559][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:12:37,877][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:12:38,196][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:12:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:12:38,834][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:12:39,152][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:12:39,767][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:12:40,086][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:12:40,405][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:12:40,724][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:12:41,042][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:12:41,361][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:12:41,680][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:12:41,999][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:12:42,318][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:12:42,636][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:12:42,955][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:12:43,274][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:12:43,594][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:12:44,256][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:12:44,993][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:12:44,996][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:12:44,998][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:12:45,648][__main__][INFO] - Iteration 207 took 27s (12.09% Gen, 85.52% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 56m 39s. Estimated total time: 7h 34m 20s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 10s. [2026-03-25 17:12:45,650][__main__][INFO] - Starting iteration 207. [2026-03-25 17:12:45,654][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:12:45,654][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:12:48,934][__main__][INFO] - Number of regex retries in iteration 207: 0 [2026-03-25 17:12:48,935][__main__][INFO] - agents played in iteration 207 are Alice, Bob [2026-03-25 17:12:49,505][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:12:50,163][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:12:50,452][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:12:50,771][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:12:51,090][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:12:51,409][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:12:51,728][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:12:52,047][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:12:52,367][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:12:52,685][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:12:53,004][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:12:53,323][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:12:53,641][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:12:53,960][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:12:54,278][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:12:54,597][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:12:54,916][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:12:55,236][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:12:55,555][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:12:55,875][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:12:56,194][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:12:56,513][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:12:56,832][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:12:57,153][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:12:57,471][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:12:57,789][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:12:58,107][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:12:58,426][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:12:58,745][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:12:59,064][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:12:59,383][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:12:59,701][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:13:00,021][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:13:00,340][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:13:00,659][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:13:00,978][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:13:01,297][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:13:01,616][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:13:01,935][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:13:02,254][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:13:02,573][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:13:02,891][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:13:03,211][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:13:03,530][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:13:03,849][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:13:04,168][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:13:04,487][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:13:04,806][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:13:05,124][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:13:05,443][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:13:05,762][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:13:06,081][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:13:06,401][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:13:07,017][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:13:07,336][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:13:07,654][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:13:07,974][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:13:08,293][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:13:08,614][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:13:08,934][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:13:09,252][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:13:09,571][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:13:09,890][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:13:10,209][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:13:10,528][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:13:10,847][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:13:11,510][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:13:12,252][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:13:12,255][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:13:12,257][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:13:12,896][__main__][INFO] - Iteration 208 took 27s (12.04% Gen, 85.60% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 55m 54s. Estimated total time: 7h 34m 3s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 24s, 500 more iterations: 3h 47m 1s. [2026-03-25 17:13:12,898][__main__][INFO] - Starting iteration 208. [2026-03-25 17:13:12,901][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:13:12,902][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:13:16,179][__main__][INFO] - Number of regex retries in iteration 208: 0 [2026-03-25 17:13:16,179][__main__][INFO] - agents played in iteration 208 are Alice, Bob [2026-03-25 17:13:16,752][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:13:17,417][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:13:17,708][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:13:18,027][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:13:18,345][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:13:18,663][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:13:18,982][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:13:19,301][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:13:19,620][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:13:19,939][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:13:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:13:20,579][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:13:20,897][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:13:21,216][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:13:21,535][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:13:21,854][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:13:22,173][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:13:22,491][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:13:22,810][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:13:23,130][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:13:23,451][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:13:23,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:13:24,090][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:13:24,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:13:24,731][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:13:25,052][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:13:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:13:25,691][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:13:26,010][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:13:26,329][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:13:26,648][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:13:26,968][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:13:27,287][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:13:27,607][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:13:27,926][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:13:28,245][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:13:28,565][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:13:28,884][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:13:29,203][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:13:29,522][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:13:29,842][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:13:30,161][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:13:30,481][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:13:30,801][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:13:31,119][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:13:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:13:31,758][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:13:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:13:32,397][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:13:32,718][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:13:33,038][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:13:33,358][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:13:33,677][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:13:34,293][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:13:34,613][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:13:34,933][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:13:35,251][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:13:35,570][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:13:35,889][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:13:36,207][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:13:36,525][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:13:36,843][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:13:37,164][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:13:37,484][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:13:37,802][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:13:38,122][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:13:38,785][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:13:39,522][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:13:39,524][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:13:39,525][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:13:40,170][__main__][INFO] - Iteration 209 took 27s (12.02% Gen, 85.61% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 55m 53s. Estimated total time: 7h 34m 29s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 14s. [2026-03-25 17:13:40,172][__main__][INFO] - Starting iteration 209. [2026-03-25 17:13:40,175][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:13:40,176][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:13:43,449][__main__][INFO] - Number of regex retries in iteration 209: 0 [2026-03-25 17:13:43,450][__main__][INFO] - agents played in iteration 209 are Alice, Bob [2026-03-25 17:13:44,023][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:13:44,682][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:13:44,971][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:13:45,291][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:13:45,609][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:13:45,927][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:13:46,246][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:13:46,565][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:13:46,884][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:13:47,202][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:13:47,521][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:13:47,840][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:13:48,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:13:48,479][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:13:48,800][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:13:49,120][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:13:49,439][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:13:49,758][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:13:50,077][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:13:50,396][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:13:50,715][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:13:51,035][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:13:51,354][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:13:51,673][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:13:51,992][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:13:52,312][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:13:52,631][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:13:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:13:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:13:53,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:13:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:13:54,228][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:13:54,546][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:13:54,865][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:13:55,184][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:13:55,504][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:13:55,823][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:13:56,142][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:13:56,461][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:13:56,780][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:13:57,099][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:13:57,419][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:13:57,738][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:13:58,058][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:13:58,377][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:13:58,695][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:13:59,013][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:13:59,332][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:13:59,651][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:13:59,970][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:14:00,290][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:14:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:14:00,930][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:14:01,544][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:14:01,863][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:14:02,182][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:14:02,501][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:14:02,820][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:14:03,141][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:14:03,461][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:14:03,781][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:14:04,100][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:14:04,419][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:14:04,738][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:14:05,057][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:14:05,375][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:14:06,053][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:14:06,801][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:14:06,803][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:14:06,805][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:14:07,456][__main__][INFO] - Iteration 210 took 27s (12.00% Gen, 85.61% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 55m 39s. Estimated total time: 7h 34m 42s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 21s. [2026-03-25 17:14:07,459][__main__][INFO] - Starting iteration 210. [2026-03-25 17:14:07,462][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:14:07,462][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:14:10,657][__main__][INFO] - Number of regex retries in iteration 210: 0 [2026-03-25 17:14:10,658][__main__][INFO] - agents played in iteration 210 are Alice, Bob [2026-03-25 17:14:11,235][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:14:11,907][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:14:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:14:12,514][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:14:12,833][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:14:13,152][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:14:13,470][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:14:13,790][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:14:14,109][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:14:14,428][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:14:14,747][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:14:15,066][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:14:15,385][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:14:15,703][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:14:16,022][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:14:16,341][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:14:16,661][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:14:16,980][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:14:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:14:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:14:17,937][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:14:18,256][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:14:18,574][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:14:18,893][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:14:19,212][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:14:19,533][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:14:19,852][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:14:20,170][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:14:20,489][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:14:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:14:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:14:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:14:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:14:22,083][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:14:22,403][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:14:22,722][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:14:23,043][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:14:23,362][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:14:23,681][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:14:24,000][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:14:24,320][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:14:24,639][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:14:24,958][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:14:25,279][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:14:25,598][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:14:25,918][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:14:26,237][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:14:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:14:26,876][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:14:27,195][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:14:27,515][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:14:27,835][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:14:28,154][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:14:28,777][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:14:29,096][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:14:29,416][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:14:29,736][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:14:30,055][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:14:30,374][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:14:30,693][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:14:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:14:31,331][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:14:31,650][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:14:31,969][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:14:32,289][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:14:32,609][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:14:33,283][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:14:34,019][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:14:34,022][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:14:34,023][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:14:34,674][__main__][INFO] - Iteration 211 took 27s (11.74% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 54m 3s. Estimated total time: 7h 33m 33s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 46s. [2026-03-25 17:14:34,676][__main__][INFO] - Starting iteration 211. [2026-03-25 17:14:34,679][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:14:34,680][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:14:37,890][__main__][INFO] - Number of regex retries in iteration 211: 0 [2026-03-25 17:14:37,891][__main__][INFO] - agents played in iteration 211 are Alice, Bob [2026-03-25 17:14:38,459][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:14:39,132][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:14:39,421][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:14:39,741][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:14:40,060][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:14:40,379][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:14:40,700][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:14:41,020][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:14:41,339][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:14:41,659][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:14:41,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:14:42,298][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:14:42,618][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:14:42,939][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:14:43,259][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:14:43,580][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:14:43,900][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:14:44,220][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:14:44,540][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:14:44,860][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:14:45,180][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:14:45,502][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:14:45,822][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:14:46,141][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:14:46,462][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:14:46,782][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:14:47,103][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:14:47,423][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:14:47,742][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:14:48,061][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:14:48,382][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:14:48,702][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:14:49,022][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:14:49,341][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:14:49,662][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:14:49,981][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:14:50,300][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:14:50,620][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:14:50,939][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:14:51,259][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:14:51,579][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:14:51,899][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:14:52,218][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:14:52,539][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:14:52,860][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:14:53,180][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:14:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:14:53,819][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:14:54,139][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:14:54,459][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:14:54,779][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:14:55,100][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:14:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:14:56,041][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:14:56,361][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:14:56,681][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:14:57,001][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:14:57,320][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:14:57,641][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:14:57,960][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:14:58,280][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:14:58,599][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:14:58,919][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:14:59,238][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:14:59,557][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:14:59,877][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:15:00,552][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:15:01,294][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:15:01,296][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:15:01,297][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:15:01,949][__main__][INFO] - Iteration 212 took 27s (11.77% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 54m 32s. Estimated total time: 7h 34m 30s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 27s, 500 more iterations: 3h 47m 15s. [2026-03-25 17:15:01,951][__main__][INFO] - Starting iteration 212. [2026-03-25 17:15:01,954][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:15:01,955][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:15:05,165][__main__][INFO] - Number of regex retries in iteration 212: 0 [2026-03-25 17:15:05,166][__main__][INFO] - agents played in iteration 212 are Alice, Bob [2026-03-25 17:15:05,752][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:15:06,410][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:15:06,699][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:15:07,019][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:15:07,339][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:15:07,659][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:15:07,978][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:15:08,297][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:15:08,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:15:08,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:15:09,254][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:15:09,574][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:15:09,893][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:15:10,212][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:15:10,532][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:15:10,852][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:15:11,170][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:15:11,489][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:15:11,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:15:12,128][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:15:12,447][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:15:12,765][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:15:13,084][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:15:13,404][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:15:13,723][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:15:14,044][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:15:14,364][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:15:14,683][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:15:15,002][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:15:15,321][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:15:15,640][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:15:15,959][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:15:16,280][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:15:16,599][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:15:16,919][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:15:17,239][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:15:17,559][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:15:17,878][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:15:18,198][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:15:18,518][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:15:18,837][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:15:19,156][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:15:19,477][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:15:19,796][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:15:20,115][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:15:20,434][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:15:20,753][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:15:21,072][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:15:21,390][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:15:21,708][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:15:22,027][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:15:22,347][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:15:22,666][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:15:23,281][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:15:23,600][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:15:23,921][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:15:24,240][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:15:24,560][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:15:24,879][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:15:25,199][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:15:25,518][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:15:25,837][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:15:26,155][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:15:26,474][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:15:26,793][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:15:27,112][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:15:28,595][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:22 [2026-03-25 17:15:29,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:15:29,329][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:15:29,331][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:15:29,971][__main__][INFO] - Iteration 213 took 28s (11.46% Gen, 86.25% Train). Generation: 3s, Training: 24s. Estimated remaining time: 6h 6m 32s. Estimated total time: 7h 46m 57s. Time estimates for 10 more iterations: 4m 40s, 100 more iterations: 46m 41s, 500 more iterations: 3h 53m 28s. [2026-03-25 17:15:29,973][__main__][INFO] - Starting iteration 213. [2026-03-25 17:15:29,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:15:29,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:15:33,148][__main__][INFO] - Number of regex retries in iteration 213: 0 [2026-03-25 17:15:33,149][__main__][INFO] - agents played in iteration 213 are Alice, Bob [2026-03-25 17:15:33,718][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:15:34,376][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:15:34,665][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:15:34,985][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:15:35,305][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:15:35,625][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:15:35,945][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:15:36,264][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:15:36,583][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:15:36,902][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:15:37,221][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:15:37,541][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:15:37,860][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:15:38,181][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:15:38,501][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:15:38,821][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:15:39,140][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:15:39,460][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:15:39,778][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:15:40,096][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:15:40,416][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:15:40,735][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:15:41,057][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:15:41,376][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:15:41,696][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:15:42,014][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:15:42,333][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:15:42,652][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:15:42,972][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:15:43,291][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:15:43,609][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:15:43,927][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:15:44,246][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:15:44,565][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:15:44,884][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:15:45,204][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:15:45,523][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:15:45,842][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:15:46,161][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:15:46,480][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:15:46,801][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:15:47,121][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:15:47,439][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:15:47,759][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:15:48,078][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:15:48,398][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:15:48,716][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:15:49,035][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:15:49,354][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:15:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:15:49,991][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:15:50,310][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:15:50,631][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:15:51,245][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:15:51,565][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:15:51,884][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:15:52,204][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:15:52,524][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:15:52,842][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:15:53,161][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:15:53,480][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:15:53,800][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:15:54,120][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:15:54,439][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:15:54,758][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:15:55,078][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:15:55,747][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:15:56,480][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:15:56,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:15:56,485][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:15:57,126][__main__][INFO] - Iteration 214 took 27s (11.68% Gen, 85.95% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 51m 38s. Estimated total time: 7h 32m 30s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 15s. [2026-03-25 17:15:57,128][__main__][INFO] - Starting iteration 214. [2026-03-25 17:15:57,131][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:15:57,131][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:16:00,349][__main__][INFO] - Number of regex retries in iteration 214: 0 [2026-03-25 17:16:00,350][__main__][INFO] - agents played in iteration 214 are Alice, Bob [2026-03-25 17:16:00,954][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:16:01,645][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:16:01,934][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:16:02,253][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:16:02,571][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:16:02,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:16:03,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:16:03,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:16:03,846][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:16:04,166][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:16:04,485][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:16:04,803][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:16:05,122][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:16:05,441][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:16:05,760][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:16:06,079][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:16:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:16:06,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:16:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:16:07,356][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:16:07,676][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:16:07,995][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:16:08,314][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:16:08,632][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:16:08,951][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:16:09,271][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:16:09,589][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:16:09,908][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:16:10,228][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:16:10,547][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:16:10,867][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:16:11,186][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:16:11,505][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:16:11,824][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:16:12,145][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:16:12,464][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:16:12,783][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:16:13,102][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:16:13,421][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:16:13,740][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:16:14,060][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:16:14,379][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:16:14,698][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:16:15,017][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:16:15,336][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:16:15,655][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:16:15,974][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:16:16,294][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:16:16,612][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:16:16,932][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:16:17,251][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:16:17,570][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:16:17,889][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:16:18,506][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:16:18,825][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:16:19,144][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:16:19,463][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:16:19,782][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:16:20,102][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:16:20,421][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:16:20,739][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:16:21,060][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:16:21,379][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:16:21,698][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:16:22,018][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:16:22,337][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:16:23,003][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:16:23,744][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:16:23,746][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:16:23,748][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:16:24,393][__main__][INFO] - Iteration 215 took 27s (11.81% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 53m 3s. Estimated total time: 7h 34m 23s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 11s. [2026-03-25 17:16:24,395][__main__][INFO] - Starting iteration 215. [2026-03-25 17:16:24,398][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:16:24,399][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:16:27,587][__main__][INFO] - Number of regex retries in iteration 215: 0 [2026-03-25 17:16:27,588][__main__][INFO] - agents played in iteration 215 are Alice, Bob [2026-03-25 17:16:28,473][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:16:29,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:16:29,419][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:16:29,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:16:30,058][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:16:30,378][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:16:30,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:16:31,016][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:16:31,335][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:16:31,655][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:16:31,974][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:16:32,293][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:16:32,612][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:16:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:16:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:16:33,568][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:16:33,887][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:16:34,205][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:16:34,524][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:16:34,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:16:35,163][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:16:35,481][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:16:35,800][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:16:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:16:36,437][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:16:36,756][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:16:37,076][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:16:37,396][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:16:37,715][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:16:38,035][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:16:38,354][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:16:38,673][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:16:38,991][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:16:39,311][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:16:39,630][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:16:39,950][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:16:40,270][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:16:40,590][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:16:40,909][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:16:41,228][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:16:41,547][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:16:41,866][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:16:42,185][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:16:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:16:42,822][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:16:43,140][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:16:43,460][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:16:43,779][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:16:44,098][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:16:44,417][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:16:44,736][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:16:45,056][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:16:45,376][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:16:45,989][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:16:46,308][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:16:46,627][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:16:46,947][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:16:47,267][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:16:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:16:47,905][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:16:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:16:48,543][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:16:48,862][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:16:49,181][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:16:49,500][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:16:49,820][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:16:50,479][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:16:51,211][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:16:51,213][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:16:51,215][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:16:51,853][__main__][INFO] - Iteration 216 took 27s (11.62% Gen, 86.05% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 55m 48s. Estimated total time: 7h 37m 36s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 45s, 500 more iterations: 3h 48m 48s. [2026-03-25 17:16:51,856][__main__][INFO] - Starting iteration 216. [2026-03-25 17:16:51,859][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:16:51,859][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:16:55,070][__main__][INFO] - Number of regex retries in iteration 216: 0 [2026-03-25 17:16:55,071][__main__][INFO] - agents played in iteration 216 are Alice, Bob [2026-03-25 17:16:55,640][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:16:56,295][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:16:56,584][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:16:56,903][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:16:57,223][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:16:57,543][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:16:57,861][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:16:58,181][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:16:58,501][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:16:58,820][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:16:59,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:16:59,458][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:16:59,777][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:17:00,098][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:17:00,417][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:17:00,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:17:01,055][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:17:01,374][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:17:01,693][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:17:02,012][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:17:02,331][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:17:02,650][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:17:02,969][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:17:03,288][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:17:03,607][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:17:03,926][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:17:04,245][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:17:04,564][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:17:04,883][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:17:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:17:05,521][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:17:05,840][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:17:06,158][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:17:06,478][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:17:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:17:07,118][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:17:07,438][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:17:07,758][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:17:08,077][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:17:08,396][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:17:08,715][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:17:09,034][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:17:09,353][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:17:09,672][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:17:09,991][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:17:10,310][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:17:10,630][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:17:10,950][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:17:11,271][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:17:11,591][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:17:11,910][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:17:12,230][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:17:12,550][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:17:13,171][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:17:13,492][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:17:13,812][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:17:14,132][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:17:14,451][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:17:14,770][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:17:15,089][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:17:15,408][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:17:15,728][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:17:16,048][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:17:16,367][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:17:16,685][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:17:17,004][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:17:17,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:17:18,392][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:17:18,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:17:18,396][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:17:19,036][__main__][INFO] - Iteration 217 took 27s (11.82% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 50m 43s. Estimated total time: 7h 32m 58s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 29s. [2026-03-25 17:17:19,038][__main__][INFO] - Starting iteration 217. [2026-03-25 17:17:19,041][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:17:19,041][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:17:20,074][mllm.models.large_language_model_local][WARNING] - Response or .> did not match regex: (|), retry 1/1 [2026-03-25 17:17:23,543][__main__][INFO] - Number of regex retries in iteration 217: 1 [2026-03-25 17:17:23,544][__main__][INFO] - agents played in iteration 217 are Alice, Bob [2026-03-25 17:17:24,126][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:17:24,789][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:17:25,079][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:17:25,399][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:17:25,718][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:17:26,036][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:17:26,354][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:17:26,673][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:17:26,991][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:17:27,310][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:17:27,628][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:17:27,945][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:17:28,263][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:17:28,582][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:17:28,900][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:17:29,218][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:17:29,537][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:17:29,857][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:17:30,175][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:17:30,494][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:17:30,812][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:17:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:17:31,449][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:17:31,769][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:17:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:17:32,405][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:17:32,724][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:17:33,043][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:17:33,362][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:17:33,681][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:17:34,000][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:17:34,319][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:17:34,639][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:17:34,958][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:17:35,278][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:17:35,597][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:17:35,916][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:17:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:17:36,552][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:17:36,871][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:17:37,189][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:17:37,507][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:17:37,825][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:17:38,144][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:17:38,463][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:17:38,782][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:17:39,101][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:17:39,420][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:17:39,740][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:17:40,059][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:17:40,377][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:17:40,696][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:17:41,017][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:17:41,639][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:17:41,958][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:17:42,277][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:17:42,596][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:17:42,917][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:17:43,236][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:17:43,555][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:17:43,874][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:17:44,192][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:17:44,510][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:17:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:17:45,148][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:17:45,467][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:17:46,138][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:17:46,881][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:17:46,883][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:17:46,885][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:17:47,531][__main__][INFO] - Iteration 218 took 28s (15.80% Gen, 81.92% Train). Generation: 4s, Training: 23s. Estimated remaining time: 6h 12m 8s. Estimated total time: 7h 54m 51s. Time estimates for 10 more iterations: 4m 44s, 100 more iterations: 47m 29s, 500 more iterations: 3h 57m 25s. [2026-03-25 17:17:47,533][__main__][INFO] - Starting iteration 218. [2026-03-25 17:17:47,536][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:17:47,537][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:17:50,766][__main__][INFO] - Number of regex retries in iteration 218: 0 [2026-03-25 17:17:50,767][__main__][INFO] - agents played in iteration 218 are Alice, Bob [2026-03-25 17:17:51,385][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:17:52,053][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:17:52,342][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:17:52,661][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:17:52,980][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:17:53,300][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:17:53,620][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:17:53,938][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:17:54,257][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:17:54,577][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:17:54,895][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:17:55,214][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:17:55,535][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:17:55,854][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:17:56,172][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:17:56,491][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:17:56,811][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:17:57,130][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:17:57,450][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:17:57,769][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:17:58,087][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:17:58,406][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:17:58,726][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:17:59,045][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:17:59,363][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:17:59,682][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:18:00,001][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:18:00,319][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:18:00,638][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:18:00,957][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:18:01,276][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:18:01,596][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:18:01,915][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:18:02,233][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:18:02,552][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:18:02,871][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:18:03,190][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:18:03,510][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:18:03,828][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:18:04,147][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:18:04,466][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:18:04,786][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:18:05,105][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:18:05,425][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:18:05,743][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:18:06,062][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:18:06,383][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:18:06,701][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:18:07,021][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:18:07,341][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:18:07,661][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:18:07,980][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:18:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:18:08,921][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:18:09,241][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:18:09,561][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:18:09,880][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:18:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:18:10,518][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:18:10,837][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:18:11,156][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:18:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:18:11,795][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:18:12,113][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:18:12,432][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:18:12,751][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:18:13,846][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:18:14,582][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:18:14,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:18:14,586][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:18:15,234][__main__][INFO] - Iteration 219 took 27s (11.66% Gen, 85.99% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 58m 27s. Estimated total time: 7h 41m 38s. Time estimates for 10 more iterations: 4m 36s, 100 more iterations: 46m 9s, 500 more iterations: 3h 50m 49s. [2026-03-25 17:18:15,236][__main__][INFO] - Starting iteration 219. [2026-03-25 17:18:15,238][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:18:15,239][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:18:18,449][__main__][INFO] - Number of regex retries in iteration 219: 0 [2026-03-25 17:18:18,450][__main__][INFO] - agents played in iteration 219 are Alice, Bob [2026-03-25 17:18:18,999][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:18:19,660][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:18:19,949][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:18:20,268][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:18:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:18:20,907][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:18:21,227][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:18:21,546][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:18:21,866][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:18:22,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:18:22,506][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:18:22,825][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:18:23,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:18:23,464][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:18:23,783][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:18:24,104][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:18:24,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:18:24,743][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:18:25,063][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:18:25,382][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:18:25,702][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:18:26,022][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:18:26,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:18:26,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:18:26,981][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:18:27,301][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:18:27,621][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:18:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:18:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:18:28,581][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:18:28,901][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:18:29,220][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:18:29,540][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:18:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:18:30,181][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:18:30,500][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:18:30,819][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:18:31,138][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:18:31,458][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:18:31,777][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:18:32,097][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:18:32,416][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:18:32,735][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:18:33,054][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:18:33,373][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:18:33,692][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:18:34,011][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:18:34,329][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:18:34,648][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:18:34,967][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:18:35,285][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:18:35,604][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:18:35,923][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:18:36,540][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:18:36,860][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:18:37,178][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:18:37,498][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:18:37,819][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:18:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:18:38,457][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:18:38,776][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:18:39,095][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:18:39,413][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:18:39,732][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:18:40,050][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:18:40,369][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:18:41,033][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:18:41,773][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:18:41,775][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:18:41,777][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:18:42,424][__main__][INFO] - Iteration 220 took 27s (11.81% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 49m 28s. Estimated total time: 7h 33m 6s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 33s. [2026-03-25 17:18:42,427][__main__][INFO] - Starting iteration 220. [2026-03-25 17:18:42,430][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:18:42,430][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:18:46,152][__main__][INFO] - Number of regex retries in iteration 220: 0 [2026-03-25 17:18:46,153][__main__][INFO] - agents played in iteration 220 are Alice, Bob [2026-03-25 17:18:46,699][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:18:47,370][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:18:47,659][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:18:47,979][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:18:48,298][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:18:48,618][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:18:48,936][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:18:49,255][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:18:49,573][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:18:49,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:18:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:18:50,530][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:18:50,849][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:18:51,169][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:18:51,488][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:18:51,807][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:18:52,127][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:18:52,447][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:18:52,766][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:18:53,085][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:18:53,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:18:53,723][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:18:54,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:18:54,362][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:18:54,680][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:18:55,000][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:18:55,319][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:18:55,638][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:18:55,957][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:18:56,277][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:18:56,596][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:18:56,915][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:18:57,235][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:18:57,555][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:18:57,875][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:18:58,193][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:18:58,512][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:18:58,831][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:18:59,150][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:18:59,469][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:18:59,788][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:19:00,109][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:19:00,428][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:19:00,748][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:19:01,068][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:19:01,386][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:19:01,705][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:19:02,024][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:19:02,343][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:19:02,663][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:19:02,982][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:19:03,301][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:19:03,620][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:19:04,242][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:19:04,562][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:19:04,881][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:19:05,200][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:19:05,520][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:19:05,839][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:19:06,159][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:19:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:19:06,798][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:19:07,117][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:19:07,436][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:19:07,756][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:19:08,076][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:19:08,751][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:19:09,488][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:19:09,490][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:19:09,492][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:19:10,143][__main__][INFO] - Iteration 221 took 27s (13.43% Gen, 84.21% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 57m 48s. Estimated total time: 7h 41m 54s. Time estimates for 10 more iterations: 4m 37s, 100 more iterations: 46m 11s, 500 more iterations: 3h 50m 57s. [2026-03-25 17:19:10,145][__main__][INFO] - Starting iteration 221. [2026-03-25 17:19:10,148][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:19:10,149][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:19:13,373][__main__][INFO] - Number of regex retries in iteration 221: 0 [2026-03-25 17:19:13,374][__main__][INFO] - agents played in iteration 221 are Alice, Bob [2026-03-25 17:19:13,958][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:19:14,629][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:19:15,003][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:19:15,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:19:15,641][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:19:15,960][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:19:16,279][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:19:16,598][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:19:16,917][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:19:17,236][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:19:17,556][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:19:17,875][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:19:18,194][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:19:18,514][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:19:18,833][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:19:19,152][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:19:19,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:19:19,790][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:19:20,110][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:19:20,429][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:19:20,748][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:19:21,067][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:19:21,387][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:19:21,706][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:19:22,025][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:19:22,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:19:22,663][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:19:22,982][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:19:23,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:19:23,621][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:19:23,940][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:19:24,259][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:19:24,579][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:19:24,898][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:19:25,218][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:19:25,537][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:19:25,855][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:19:26,175][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:19:26,495][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:19:26,814][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:19:27,133][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:19:27,452][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:19:27,773][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:19:28,092][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:19:28,411][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:19:28,730][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:19:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:19:29,368][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:19:29,687][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:19:30,006][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:19:30,326][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:19:30,645][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:19:30,964][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:19:31,586][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:19:31,905][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:19:32,224][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:19:32,543][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:19:32,862][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:19:33,182][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:19:33,502][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:19:33,822][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:19:34,141][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:19:34,461][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:19:34,780][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:19:35,101][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:19:35,421][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:19:36,099][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:19:36,833][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:19:36,835][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:19:36,837][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:19:37,491][__main__][INFO] - Iteration 222 took 27s (11.80% Gen, 85.81% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 51m 10s. Estimated total time: 7h 35m 43s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 34s, 500 more iterations: 3h 47m 51s. [2026-03-25 17:19:37,493][__main__][INFO] - Starting iteration 222. [2026-03-25 17:19:37,497][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:19:37,497][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:19:40,806][__main__][INFO] - Number of regex retries in iteration 222: 0 [2026-03-25 17:19:40,807][__main__][INFO] - agents played in iteration 222 are Alice, Bob [2026-03-25 17:19:41,390][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:19:42,072][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:19:42,362][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:19:42,682][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:19:43,001][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:19:43,321][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:19:43,640][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:19:43,960][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:19:44,282][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:19:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:19:44,923][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:19:45,241][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:19:45,560][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:19:45,878][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:19:46,197][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:19:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:19:46,835][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:19:47,155][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:19:47,474][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:19:47,793][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:19:48,111][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:19:48,430][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:19:48,750][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:19:49,069][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:19:49,389][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:19:49,708][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:19:50,027][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:19:50,345][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:19:50,664][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:19:50,983][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:19:51,301][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:19:51,621][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:19:51,941][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:19:52,260][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:19:52,579][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:19:52,898][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:19:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:19:53,536][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:19:53,855][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:19:54,175][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:19:54,494][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:19:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:19:55,133][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:19:55,453][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:19:55,772][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:19:56,091][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:19:56,409][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:19:56,729][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:19:57,048][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:19:57,367][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:19:57,686][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:19:58,004][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:19:58,323][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:19:58,946][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:19:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:19:59,585][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:19:59,905][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:20:00,224][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:20:00,543][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:20:00,862][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:20:01,182][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:20:01,501][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:20:01,820][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:20:02,140][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:20:02,459][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:20:02,778][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:20:03,454][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:20:04,199][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:20:04,201][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:20:04,203][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:20:04,855][__main__][INFO] - Iteration 223 took 27s (12.10% Gen, 85.51% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 50m 59s. Estimated total time: 7h 35m 59s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 35s, 500 more iterations: 3h 47m 59s. [2026-03-25 17:20:04,857][__main__][INFO] - Starting iteration 223. [2026-03-25 17:20:04,860][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:20:04,861][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:20:08,076][__main__][INFO] - Number of regex retries in iteration 223: 0 [2026-03-25 17:20:08,077][__main__][INFO] - agents played in iteration 223 are Alice, Bob [2026-03-25 17:20:08,662][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:20:09,314][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:20:09,604][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:20:09,923][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:20:10,242][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:20:10,562][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:20:10,881][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:20:11,199][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:20:11,518][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:20:11,838][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:20:12,157][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:20:12,475][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:20:12,793][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:20:13,112][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:20:13,431][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:20:13,750][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:20:14,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:20:14,387][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:20:14,706][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:20:15,025][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:20:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:20:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:20:15,985][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:20:16,305][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:20:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:20:16,944][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:20:17,264][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:20:17,583][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:20:17,903][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:20:18,222][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:20:18,543][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:20:18,863][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:20:19,181][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:20:19,501][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:20:19,821][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:20:20,140][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:20:20,459][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:20:20,780][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:20:21,099][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:20:21,418][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:20:21,737][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:20:22,056][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:20:22,376][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:20:22,694][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:20:23,013][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:20:23,333][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:20:23,652][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:20:23,971][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:20:24,289][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:20:24,608][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:20:24,927][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:20:25,246][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:20:25,566][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:20:26,179][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:20:26,498][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:20:26,818][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:20:27,138][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:20:27,457][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:20:27,776][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:20:28,095][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:20:28,414][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:20:28,733][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:20:29,052][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:20:29,372][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:20:29,691][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:20:30,011][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:20:30,666][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:20:31,405][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:20:31,407][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:20:31,409][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:20:32,058][__main__][INFO] - Iteration 224 took 27s (11.82% Gen, 85.78% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 47m 50s. Estimated total time: 7h 33m 18s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 39s. [2026-03-25 17:20:32,060][__main__][INFO] - Starting iteration 224. [2026-03-25 17:20:32,063][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:20:32,063][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:20:35,273][__main__][INFO] - Number of regex retries in iteration 224: 0 [2026-03-25 17:20:35,274][__main__][INFO] - agents played in iteration 224 are Alice, Bob [2026-03-25 17:20:35,861][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:20:36,513][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:20:36,802][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:20:37,122][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:20:37,441][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:20:37,760][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:20:38,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:20:38,399][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:20:38,720][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:20:39,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:20:39,358][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:20:39,676][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:20:39,995][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:20:40,314][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:20:40,633][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:20:40,953][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:20:41,272][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:20:41,592][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:20:41,910][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:20:42,229][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:20:42,549][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:20:42,868][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:20:43,187][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:20:43,505][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:20:43,824][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:20:44,143][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:20:44,463][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:20:44,783][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:20:45,103][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:20:45,424][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:20:45,743][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:20:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:20:46,380][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:20:46,700][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:20:47,020][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:20:47,340][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:20:47,659][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:20:47,978][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:20:48,297][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:20:48,616][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:20:48,936][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:20:49,255][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:20:49,575][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:20:49,894][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:20:50,214][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:20:50,534][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:20:50,853][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:20:51,171][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:20:51,491][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:20:51,810][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:20:52,129][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:20:52,449][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:20:52,769][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:20:53,387][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:20:53,706][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:20:54,026][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:20:54,346][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:20:54,666][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:20:54,986][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:20:55,307][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:20:55,626][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:20:55,948][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:20:56,268][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:20:56,588][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:20:56,907][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:20:57,230][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:20:57,901][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:20:58,637][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:20:58,639][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:20:58,641][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:20:59,286][__main__][INFO] - Iteration 225 took 27s (11.79% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 47m 49s. Estimated total time: 7h 33m 44s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 22s, 500 more iterations: 3h 46m 52s. [2026-03-25 17:20:59,288][__main__][INFO] - Starting iteration 225. [2026-03-25 17:20:59,291][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:20:59,292][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:21:02,517][__main__][INFO] - Number of regex retries in iteration 225: 0 [2026-03-25 17:21:02,518][__main__][INFO] - agents played in iteration 225 are Alice, Bob [2026-03-25 17:21:03,098][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:21:03,751][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:21:04,040][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:21:04,359][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:21:04,679][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:21:04,998][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:21:05,318][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:21:05,637][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:21:05,955][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:21:06,274][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:21:06,594][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:21:06,914][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:21:07,233][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:21:07,553][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:21:07,873][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:21:08,192][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:21:08,510][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:21:08,831][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:21:09,149][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:21:09,468][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:21:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:21:10,106][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:21:10,425][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:21:10,744][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:21:11,063][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:21:11,382][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:21:11,702][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:21:12,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:21:12,342][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:21:12,661][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:21:12,980][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:21:13,299][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:21:13,619][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:21:13,938][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:21:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:21:14,578][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:21:14,896][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:21:15,215][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:21:15,534][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:21:15,854][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:21:16,173][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:21:16,492][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:21:16,812][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:21:17,131][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:21:17,450][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:21:17,769][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:21:18,089][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:21:18,409][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:21:18,727][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:21:19,047][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:21:19,367][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:21:19,686][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:21:20,004][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:21:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:21:20,935][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:21:21,255][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:21:21,575][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:21:21,894][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:21:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:21:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:21:22,850][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:21:23,168][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:21:23,487][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:21:23,805][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:21:24,124][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:21:24,444][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:21:25,100][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:21:25,841][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:21:25,843][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:21:25,845][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:21:26,492][__main__][INFO] - Iteration 226 took 27s (11.86% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 46m 59s. Estimated total time: 7h 33m 22s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 20s, 500 more iterations: 3h 46m 41s. [2026-03-25 17:21:26,495][__main__][INFO] - Starting iteration 226. [2026-03-25 17:21:26,498][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:21:26,498][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:21:29,683][__main__][INFO] - Number of regex retries in iteration 226: 0 [2026-03-25 17:21:29,684][__main__][INFO] - agents played in iteration 226 are Alice, Bob [2026-03-25 17:21:30,326][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:21:30,977][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:21:31,266][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:21:31,585][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:21:31,904][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:21:32,224][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:21:32,542][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:21:32,861][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:21:33,180][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:21:33,499][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:21:33,818][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:21:34,138][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:21:34,457][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:21:34,775][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:21:35,093][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:21:35,413][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:21:35,733][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:21:36,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:21:36,370][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:21:36,689][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:21:37,008][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:21:37,326][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:21:37,645][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:21:37,963][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:21:38,283][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:21:38,602][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:21:38,921][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:21:39,241][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:21:39,561][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:21:39,880][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:21:40,199][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:21:40,518][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:21:40,836][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:21:41,155][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:21:41,475][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:21:41,794][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:21:42,113][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:21:42,432][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:21:42,752][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:21:43,071][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:21:43,389][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:21:43,708][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:21:44,027][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:21:44,345][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:21:44,664][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:21:44,983][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:21:45,302][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:21:45,622][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:21:45,941][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:21:46,261][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:21:46,581][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:21:46,901][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:21:47,222][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:21:47,835][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:21:48,153][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:21:48,472][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:21:48,790][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:21:49,109][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:21:49,428][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:21:49,747][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:21:50,066][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:21:50,385][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:21:50,704][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:21:51,023][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:21:51,342][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:21:51,661][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:21:52,318][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:21:53,056][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:21:53,058][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:21:53,060][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:21:53,707][__main__][INFO] - Iteration 227 took 27s (11.71% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 46m 40s. Estimated total time: 7h 33m 29s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 20s, 500 more iterations: 3h 46m 44s. [2026-03-25 17:21:53,709][__main__][INFO] - Starting iteration 227. [2026-03-25 17:21:53,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:21:53,712][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:21:57,019][__main__][INFO] - Number of regex retries in iteration 227: 0 [2026-03-25 17:21:57,020][__main__][INFO] - agents played in iteration 227 are Alice, Bob [2026-03-25 17:21:57,629][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:21:58,300][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:21:58,589][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:21:58,909][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:21:59,228][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:21:59,547][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:21:59,866][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:22:00,185][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:22:00,504][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:22:00,823][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:22:01,142][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:22:01,461][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:22:01,781][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:22:02,101][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:22:02,419][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:22:02,739][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:22:03,059][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:22:03,379][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:22:03,699][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:22:04,018][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:22:04,338][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:22:04,657][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:22:04,977][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:22:05,297][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:22:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:22:05,936][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:22:06,257][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:22:06,576][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:22:06,895][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:22:07,214][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:22:07,534][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:22:07,854][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:22:08,174][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:22:08,494][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:22:08,813][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:22:09,134][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:22:09,454][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:22:09,774][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:22:10,094][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:22:10,415][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:22:10,734][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:22:11,054][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:22:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:22:11,696][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:22:12,016][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:22:12,335][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:22:12,656][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:22:12,977][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:22:13,297][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:22:13,618][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:22:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:22:14,258][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:22:14,577][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:22:15,199][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:22:15,519][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:22:15,838][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:22:16,156][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:22:16,475][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:22:16,793][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:22:17,112][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:22:17,433][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:22:17,753][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:22:18,072][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:22:18,392][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:22:18,712][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:22:19,031][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:22:19,705][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:22:20,448][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:22:20,450][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:22:20,452][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:22:21,102][__main__][INFO] - Iteration 228 took 27s (12.07% Gen, 85.55% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 49m 14s. Estimated total time: 7h 36m 31s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 15s. [2026-03-25 17:22:21,105][__main__][INFO] - Starting iteration 228. [2026-03-25 17:22:21,107][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:22:21,108][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:22:24,327][__main__][INFO] - Number of regex retries in iteration 228: 0 [2026-03-25 17:22:24,327][__main__][INFO] - agents played in iteration 228 are Alice, Bob [2026-03-25 17:22:24,887][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:22:25,560][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:22:25,849][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:22:26,168][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:22:26,487][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:22:26,806][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:22:27,125][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:22:27,445][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:22:27,765][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:22:28,084][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:22:28,403][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:22:28,722][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:22:29,041][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:22:29,360][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:22:29,679][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:22:29,999][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:22:30,320][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:22:30,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:22:30,957][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:22:31,277][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:22:31,597][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:22:31,917][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:22:32,237][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:22:32,556][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:22:32,874][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:22:33,193][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:22:33,512][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:22:33,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:22:34,151][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:22:34,471][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:22:34,790][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:22:35,108][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:22:35,427][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:22:35,745][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:22:36,066][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:22:36,385][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:22:36,704][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:22:37,022][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:22:37,341][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:22:37,660][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:22:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:22:38,298][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:22:38,618][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:22:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:22:39,257][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:22:39,577][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:22:39,896][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:22:40,215][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:22:40,534][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:22:40,852][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:22:41,172][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:22:41,493][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:22:41,813][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:22:42,435][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:22:42,754][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:22:43,073][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:22:43,392][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:22:43,711][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:22:44,031][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:22:44,350][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:22:44,669][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:22:44,989][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:22:45,309][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:22:45,628][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:22:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:22:46,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:22:46,943][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:22:47,681][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:22:47,683][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:22:47,685][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:22:48,338][__main__][INFO] - Iteration 229 took 27s (11.82% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 46m 7s. Estimated total time: 7h 33m 51s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 23s, 500 more iterations: 3h 46m 55s. [2026-03-25 17:22:48,340][__main__][INFO] - Starting iteration 229. [2026-03-25 17:22:48,343][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:22:48,344][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:22:51,655][__main__][INFO] - Number of regex retries in iteration 229: 0 [2026-03-25 17:22:51,656][__main__][INFO] - agents played in iteration 229 are Alice, Bob [2026-03-25 17:22:52,244][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:22:52,915][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:22:53,291][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:22:53,611][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:22:53,929][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:22:54,250][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:22:54,569][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:22:54,888][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:22:55,207][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:22:55,527][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:22:55,847][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:22:56,166][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:22:56,485][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:22:56,805][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:22:57,125][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:22:57,445][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:22:57,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:22:58,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:22:58,402][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:22:58,722][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:22:59,041][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:22:59,360][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:22:59,680][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:23:00,001][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:23:00,321][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:23:00,641][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:23:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:23:01,280][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:23:01,599][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:23:01,918][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:23:02,237][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:23:02,556][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:23:02,875][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:23:03,193][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:23:03,512][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:23:03,831][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:23:04,150][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:23:04,470][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:23:04,788][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:23:05,107][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:23:05,427][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:23:05,747][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:23:06,065][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:23:06,385][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:23:06,705][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:23:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:23:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:23:07,663][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:23:07,983][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:23:08,302][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:23:08,621][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:23:08,941][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:23:09,261][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:23:09,883][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:23:10,202][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:23:10,521][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:23:10,841][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:23:11,160][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:23:11,479][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:23:11,798][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:23:12,117][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:23:12,436][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:23:12,755][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:23:13,075][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:23:13,394][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:23:13,713][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:23:14,388][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:23:15,149][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:23:15,151][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:23:15,153][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:23:15,803][__main__][INFO] - Iteration 230 took 27s (12.06% Gen, 85.56% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 49m 29s. Estimated total time: 7h 37m 40s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 46s, 500 more iterations: 3h 48m 50s. [2026-03-25 17:23:15,805][__main__][INFO] - Starting iteration 230. [2026-03-25 17:23:15,808][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:23:15,809][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:23:19,052][__main__][INFO] - Number of regex retries in iteration 230: 0 [2026-03-25 17:23:19,053][__main__][INFO] - agents played in iteration 230 are Alice, Bob [2026-03-25 17:23:19,660][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:23:20,341][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:23:20,629][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:23:20,950][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:23:21,269][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:23:21,589][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:23:21,908][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:23:22,228][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:23:22,548][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:23:22,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:23:23,188][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:23:23,507][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:23:23,826][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:23:24,146][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:23:24,466][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:23:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:23:25,107][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:23:25,427][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:23:25,747][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:23:26,067][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:23:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:23:26,705][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:23:27,025][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:23:27,346][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:23:27,666][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:23:27,985][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:23:28,304][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:23:28,623][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:23:28,942][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:23:29,261][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:23:29,580][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:23:29,900][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:23:30,220][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:23:30,540][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:23:30,859][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:23:31,179][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:23:31,497][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:23:31,817][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:23:32,136][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:23:32,455][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:23:32,773][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:23:33,091][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:23:33,410][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:23:33,729][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:23:34,048][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:23:34,367][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:23:34,686][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:23:35,005][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:23:35,324][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:23:35,643][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:23:35,962][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:23:36,281][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:23:36,601][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:23:37,224][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:23:37,543][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:23:37,862][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:23:38,183][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:23:38,502][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:23:38,822][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:23:39,142][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:23:39,462][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:23:39,781][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:23:40,100][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:23:40,419][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:23:40,737][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:23:41,057][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:23:41,732][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:23:42,481][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:23:42,484][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:23:42,485][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:23:43,146][__main__][INFO] - Iteration 231 took 27s (11.87% Gen, 85.71% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 47m 0s. Estimated total time: 7h 35m 39s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 49s. [2026-03-25 17:23:43,149][__main__][INFO] - Starting iteration 231. [2026-03-25 17:23:43,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:23:43,152][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:23:46,417][__main__][INFO] - Number of regex retries in iteration 231: 0 [2026-03-25 17:23:46,418][__main__][INFO] - agents played in iteration 231 are Alice, Bob [2026-03-25 17:23:46,993][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:23:47,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:23:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:23:48,274][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:23:48,593][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:23:48,911][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:23:49,230][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:23:49,549][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:23:49,868][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:23:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:23:50,507][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:23:50,826][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:23:51,145][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:23:51,464][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:23:51,783][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:23:52,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:23:52,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:23:52,741][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:23:53,061][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:23:53,380][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:23:53,699][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:23:54,019][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:23:54,338][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:23:54,657][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:23:54,975][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:23:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:23:55,613][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:23:55,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:23:56,251][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:23:56,569][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:23:56,888][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:23:57,208][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:23:57,528][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:23:57,846][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:23:58,165][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:23:58,484][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:23:58,804][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:23:59,123][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:23:59,443][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:23:59,762][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:24:00,082][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:24:00,401][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:24:00,720][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:24:01,041][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:24:01,360][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:24:01,679][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:24:01,999][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:24:02,318][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:24:02,637][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:24:02,956][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:24:03,276][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:24:03,595][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:24:03,914][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:24:04,535][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:24:04,854][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:24:05,173][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:24:05,493][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:24:05,812][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:24:06,132][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:24:06,452][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:24:06,770][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:24:07,089][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:24:07,407][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:24:07,726][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:24:08,045][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:24:08,365][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:24:09,040][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:24:09,784][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:24:09,786][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:24:09,788][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:24:10,442][__main__][INFO] - Iteration 232 took 27s (11.97% Gen, 85.63% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 45m 45s. Estimated total time: 7h 34m 51s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 29s, 500 more iterations: 3h 47m 25s. [2026-03-25 17:24:10,444][__main__][INFO] - Starting iteration 232. [2026-03-25 17:24:10,447][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:24:10,448][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:24:13,736][__main__][INFO] - Number of regex retries in iteration 232: 0 [2026-03-25 17:24:13,737][__main__][INFO] - agents played in iteration 232 are Alice, Bob [2026-03-25 17:24:14,307][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:24:14,978][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:24:15,267][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:24:15,586][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:24:15,904][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:24:16,223][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:24:16,543][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:24:16,862][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:24:17,181][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:24:17,501][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:24:17,820][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:24:18,139][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:24:18,458][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:24:18,777][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:24:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:24:19,414][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:24:19,732][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:24:20,052][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:24:20,372][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:24:20,691][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:24:21,010][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:24:21,329][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:24:21,648][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:24:21,967][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:24:22,286][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:24:22,606][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:24:22,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:24:23,246][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:24:23,565][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:24:23,884][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:24:24,202][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:24:24,522][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:24:24,841][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:24:25,159][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:24:25,478][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:24:25,797][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:24:26,116][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:24:26,434][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:24:26,754][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:24:27,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:24:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:24:27,712][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:24:28,031][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:24:28,350][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:24:28,671][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:24:28,990][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:24:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:24:29,629][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:24:29,948][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:24:30,267][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:24:30,587][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:24:30,905][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:24:31,225][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:24:31,839][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:24:32,159][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:24:32,479][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:24:32,799][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:24:33,119][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:24:33,439][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:24:33,761][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:24:34,081][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:24:34,401][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:24:34,721][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:24:35,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:24:35,359][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:24:35,679][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:24:36,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:24:37,087][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:24:37,089][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:24:37,091][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:24:37,732][__main__][INFO] - Iteration 233 took 27s (12.06% Gen, 85.59% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 45m 12s. Estimated total time: 7h 34m 45s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 22s. [2026-03-25 17:24:37,734][__main__][INFO] - Starting iteration 233. [2026-03-25 17:24:37,737][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:24:37,738][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:24:40,985][__main__][INFO] - Number of regex retries in iteration 233: 0 [2026-03-25 17:24:40,985][__main__][INFO] - agents played in iteration 233 are Alice, Bob [2026-03-25 17:24:41,592][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:24:42,248][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:24:42,538][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:24:42,857][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:24:43,176][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:24:43,495][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:24:43,813][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:24:44,132][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:24:44,451][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:24:44,770][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:24:45,088][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:24:45,407][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:24:45,726][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:24:46,044][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:24:46,363][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:24:46,681][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:24:46,999][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:24:47,319][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:24:47,638][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:24:47,957][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:24:48,276][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:24:48,595][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:24:48,915][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:24:49,234][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:24:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:24:49,872][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:24:50,192][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:24:50,511][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:24:50,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:24:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:24:51,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:24:51,787][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:24:52,105][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:24:52,424][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:24:52,743][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:24:53,062][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:24:53,381][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:24:53,700][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:24:54,018][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:24:54,337][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:24:54,656][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:24:54,975][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:24:55,294][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:24:55,614][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:24:55,933][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:24:56,252][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:24:56,570][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:24:56,889][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:24:57,209][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:24:57,528][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:24:57,847][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:24:58,166][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:24:58,484][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:24:59,099][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:24:59,417][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:24:59,737][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:25:00,055][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:25:00,374][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:25:00,692][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:25:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:25:01,330][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:25:01,648][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:25:01,968][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:25:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:25:02,605][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:25:02,924][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:25:03,585][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:25:04,323][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:25:04,325][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:25:04,327][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:25:04,968][__main__][INFO] - Iteration 234 took 27s (11.93% Gen, 85.71% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 43m 51s. Estimated total time: 7h 33m 51s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 23s, 500 more iterations: 3h 46m 55s. [2026-03-25 17:25:04,970][__main__][INFO] - Starting iteration 234. [2026-03-25 17:25:04,973][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:25:04,974][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:25:08,237][__main__][INFO] - Number of regex retries in iteration 234: 0 [2026-03-25 17:25:08,238][__main__][INFO] - agents played in iteration 234 are Alice, Bob [2026-03-25 17:25:08,818][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:25:09,474][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:25:09,763][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:25:10,082][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:25:10,401][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:25:10,720][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:25:11,038][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:25:11,357][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:25:11,676][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:25:11,994][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:25:12,313][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:25:12,631][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:25:12,950][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:25:13,269][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:25:13,587][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:25:13,905][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:25:14,224][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:25:14,543][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:25:14,861][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:25:15,181][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:25:15,500][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:25:15,819][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:25:16,138][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:25:16,457][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:25:16,775][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:25:17,094][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:25:17,413][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:25:17,731][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:25:18,049][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:25:18,367][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:25:18,687][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:25:19,005][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:25:19,324][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:25:19,643][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:25:19,962][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:25:20,281][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:25:20,601][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:25:20,919][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:25:21,238][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:25:21,556][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:25:21,875][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:25:22,194][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:25:22,512][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:25:22,831][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:25:23,150][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:25:23,470][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:25:23,790][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:25:24,110][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:25:24,428][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:25:24,746][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:25:25,065][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:25:25,384][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:25:25,703][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:25:26,318][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:25:26,637][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:25:26,955][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:25:27,274][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:25:27,593][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:25:27,912][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:25:28,231][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:25:28,550][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:25:28,869][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:25:29,188][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:25:29,506][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:25:29,826][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:25:30,145][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:25:30,806][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:25:31,543][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:25:31,545][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:25:31,547][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:25:32,185][__main__][INFO] - Iteration 235 took 27s (11.99% Gen, 85.65% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 43m 5s. Estimated total time: 7h 33m 33s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 46s. [2026-03-25 17:25:32,188][__main__][INFO] - Starting iteration 235. [2026-03-25 17:25:32,191][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:25:32,191][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:25:35,445][__main__][INFO] - Number of regex retries in iteration 235: 0 [2026-03-25 17:25:35,446][__main__][INFO] - agents played in iteration 235 are Alice, Bob [2026-03-25 17:25:36,022][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:25:36,679][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:25:36,968][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:25:37,288][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:25:37,607][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:25:37,926][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:25:38,245][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:25:38,564][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:25:38,883][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:25:39,202][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:25:39,522][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:25:39,841][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:25:40,161][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:25:40,480][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:25:40,799][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:25:41,118][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:25:41,436][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:25:41,754][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:25:42,074][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:25:42,392][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:25:42,711][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:25:43,029][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:25:43,348][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:25:43,667][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:25:43,986][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:25:44,305][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:25:44,624][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:25:44,942][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:25:45,262][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:25:45,581][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:25:45,900][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:25:46,219][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:25:46,540][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:25:46,859][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:25:47,177][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:25:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:25:47,816][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:25:48,136][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:25:48,455][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:25:48,776][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:25:49,096][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:25:49,415][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:25:49,734][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:25:50,055][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:25:50,376][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:25:50,695][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:25:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:25:51,336][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:25:51,655][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:25:51,974][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:25:52,294][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:25:52,613][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:25:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:25:53,568][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:25:53,887][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:25:54,206][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:25:54,525][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:25:54,843][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:25:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:25:55,484][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:25:55,803][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:25:56,123][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:25:56,442][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:25:56,761][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:25:57,081][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:25:57,400][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:25:58,085][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:25:58,828][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:25:58,830][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:25:58,832][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:25:59,485][__main__][INFO] - Iteration 236 took 27s (11.92% Gen, 85.68% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 44m 0s. Estimated total time: 7h 34m 55s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 29s, 500 more iterations: 3h 47m 27s. [2026-03-25 17:25:59,487][__main__][INFO] - Starting iteration 236. [2026-03-25 17:25:59,490][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:25:59,490][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:26:02,753][__main__][INFO] - Number of regex retries in iteration 236: 0 [2026-03-25 17:26:02,753][__main__][INFO] - agents played in iteration 236 are Alice, Bob [2026-03-25 17:26:03,327][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:26:04,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:26:04,290][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:26:04,609][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:26:04,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:26:05,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:26:05,566][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:26:05,886][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:26:06,204][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:26:06,523][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:26:06,842][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:26:07,162][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:26:07,481][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:26:07,800][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:26:08,119][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:26:08,438][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:26:08,757][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:26:09,076][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:26:09,394][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:26:09,714][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:26:10,033][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:26:10,352][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:26:10,672][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:26:10,992][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:26:11,310][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:26:11,629][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:26:11,949][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:26:12,268][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:26:12,588][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:26:12,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:26:13,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:26:13,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:26:13,864][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:26:14,183][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:26:14,503][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:26:14,823][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:26:15,143][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:26:15,462][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:26:15,782][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:26:16,102][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:26:16,421][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:26:16,740][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:26:17,059][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:26:17,379][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:26:17,698][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:26:18,017][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:26:18,336][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:26:18,655][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:26:18,974][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:26:19,292][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:26:19,612][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:26:19,931][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:26:20,250][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:26:20,875][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:26:21,193][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:26:21,512][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:26:21,831][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:26:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:26:22,470][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:26:22,788][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:26:23,107][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:26:23,426][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:26:23,745][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:26:24,064][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:26:24,384][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:26:24,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:26:25,380][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:26:26,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:26:26,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:26:26,135][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:26:26,790][__main__][INFO] - Iteration 237 took 27s (11.95% Gen, 85.64% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 43m 38s. Estimated total time: 7h 35m 1s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 30s, 500 more iterations: 3h 47m 30s. [2026-03-25 17:26:26,792][__main__][INFO] - Starting iteration 237. [2026-03-25 17:26:26,795][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:26:26,796][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:26:30,039][__main__][INFO] - Number of regex retries in iteration 237: 0 [2026-03-25 17:26:30,039][__main__][INFO] - agents played in iteration 237 are Alice, Bob [2026-03-25 17:26:30,621][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:26:31,291][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:26:31,580][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:26:31,900][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:26:32,219][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:26:32,538][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:26:32,857][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:26:33,177][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:26:33,496][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:26:33,815][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:26:34,134][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:26:34,452][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:26:34,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:26:35,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:26:35,408][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:26:35,727][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:26:36,045][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:26:36,364][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:26:36,683][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:26:37,003][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:26:37,323][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:26:37,642][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:26:37,961][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:26:38,280][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:26:38,600][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:26:38,918][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:26:39,237][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:26:39,556][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:26:39,875][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:26:40,194][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:26:40,513][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:26:40,831][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:26:41,151][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:26:41,470][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:26:41,789][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:26:42,108][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:26:42,427][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:26:42,745][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:26:43,065][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:26:43,384][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:26:43,703][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:26:44,023][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:26:44,341][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:26:44,660][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:26:44,979][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:26:45,297][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:26:45,616][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:26:45,935][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:26:46,254][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:26:46,573][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:26:46,892][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:26:47,211][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:26:47,530][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:26:48,152][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:26:48,471][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:26:48,790][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:26:49,109][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:26:49,427][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:26:49,746][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:26:50,066][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:26:50,384][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:26:50,703][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:26:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:26:51,341][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:26:51,660][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:26:51,979][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:26:52,654][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:26:53,399][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:26:53,401][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:26:53,403][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:26:54,056][__main__][INFO] - Iteration 238 took 27s (11.90% Gen, 85.70% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 42m 32s. Estimated total time: 7h 34m 22s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 11s. [2026-03-25 17:26:54,058][__main__][INFO] - Starting iteration 238. [2026-03-25 17:26:54,061][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:26:54,062][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:26:57,302][__main__][INFO] - Number of regex retries in iteration 238: 0 [2026-03-25 17:26:57,303][__main__][INFO] - agents played in iteration 238 are Alice, Bob [2026-03-25 17:26:57,880][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:26:58,552][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:26:58,841][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:26:59,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:26:59,481][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:26:59,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:27:00,119][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:27:00,438][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:27:00,758][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:27:01,079][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:27:01,399][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:27:01,719][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:27:02,038][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:27:02,357][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:27:02,677][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:27:02,996][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:27:03,315][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:27:03,635][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:27:03,955][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:27:04,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:27:04,594][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:27:04,914][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:27:05,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:27:05,553][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:27:05,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:27:06,193][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:27:06,513][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:27:06,833][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:27:07,153][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:27:07,473][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:27:07,792][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:27:08,112][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:27:08,431][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:27:08,751][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:27:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:27:09,392][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:27:09,712][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:27:10,030][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:27:10,349][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:27:10,667][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:27:10,986][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:27:11,305][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:27:11,624][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:27:11,944][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:27:12,264][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:27:12,583][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:27:12,902][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:27:13,222][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:27:13,541][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:27:13,859][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:27:14,178][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:27:14,497][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:27:14,815][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:27:15,437][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:27:15,756][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:27:16,075][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:27:16,394][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:27:16,713][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:27:17,031][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:27:17,350][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:27:17,670][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:27:17,990][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:27:18,309][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:27:18,628][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:27:18,947][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:27:19,267][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:27:19,941][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:27:20,686][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:27:20,689][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:27:20,690][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:27:21,344][__main__][INFO] - Iteration 239 took 27s (11.88% Gen, 85.72% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 42m 26s. Estimated total time: 7h 34m 43s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 21s. [2026-03-25 17:27:21,346][__main__][INFO] - Starting iteration 239. [2026-03-25 17:27:21,349][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:27:21,349][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:27:24,597][__main__][INFO] - Number of regex retries in iteration 239: 0 [2026-03-25 17:27:24,598][__main__][INFO] - agents played in iteration 239 are Alice, Bob [2026-03-25 17:27:25,174][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:27:25,846][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:27:26,136][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:27:26,457][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:27:26,775][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:27:27,095][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:27:27,414][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:27:27,734][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:27:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:27:28,373][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:27:28,692][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:27:29,011][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:27:29,330][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:27:29,648][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:27:29,967][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:27:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:27:30,604][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:27:30,924][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:27:31,243][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:27:31,562][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:27:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:27:32,202][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:27:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:27:32,840][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:27:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:27:33,477][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:27:33,796][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:27:34,115][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:27:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:27:34,753][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:27:35,072][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:27:35,391][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:27:35,711][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:27:36,030][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:27:36,349][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:27:36,668][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:27:36,986][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:27:37,306][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:27:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:27:37,943][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:27:38,262][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:27:38,582][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:27:38,901][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:27:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:27:39,541][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:27:39,861][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:27:40,181][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:27:40,500][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:27:40,819][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:27:41,139][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:27:41,458][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:27:41,777][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:27:42,096][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:27:42,724][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:27:43,043][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:27:43,362][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:27:43,682][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:27:44,001][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:27:44,320][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:27:44,640][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:27:44,959][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:27:45,277][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:27:45,596][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:27:45,916][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:27:46,236][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:27:46,556][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:27:47,232][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:27:47,982][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:27:47,984][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:27:47,986][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:27:48,642][__main__][INFO] - Iteration 240 took 27s (11.90% Gen, 85.69% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 42m 10s. Estimated total time: 7h 34m 54s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 29s, 500 more iterations: 3h 47m 27s. [2026-03-25 17:27:48,645][__main__][INFO] - Starting iteration 240. [2026-03-25 17:27:48,648][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:27:48,649][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:27:51,820][__main__][INFO] - Number of regex retries in iteration 240: 0 [2026-03-25 17:27:51,821][__main__][INFO] - agents played in iteration 240 are Alice, Bob [2026-03-25 17:27:52,395][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:27:53,065][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:27:53,355][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:27:53,674][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:27:53,993][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:27:54,312][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:27:54,630][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:27:54,949][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:27:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:27:55,586][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:27:55,904][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:27:56,224][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:27:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:27:56,863][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:27:57,182][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:27:57,501][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:27:57,821][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:27:58,141][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:27:58,460][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:27:58,779][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:27:59,098][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:27:59,416][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:27:59,735][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:28:00,054][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:28:00,373][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:28:00,692][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:28:01,011][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:28:01,329][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:28:01,648][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:28:01,968][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:28:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:28:02,606][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:28:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:28:03,245][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:28:03,564][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:28:03,883][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:28:04,203][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:28:04,522][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:28:04,842][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:28:05,162][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:28:05,480][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:28:05,799][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:28:06,118][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:28:06,437][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:28:06,757][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:28:07,076][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:28:07,395][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:28:07,715][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:28:08,035][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:28:08,354][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:28:08,674][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:28:08,993][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:28:09,312][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:28:09,934][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:28:10,253][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:28:10,572][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:28:10,891][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:28:11,209][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:28:11,528][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:28:11,847][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:28:12,166][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:28:12,485][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:28:12,804][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:28:13,124][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:28:13,443][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:28:13,763][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:28:14,436][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:28:15,196][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:28:15,198][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:28:15,200][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:28:15,867][__main__][INFO] - Iteration 241 took 27s (11.65% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 40m 28s. Estimated total time: 7h 33m 39s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 49s. [2026-03-25 17:28:15,869][__main__][INFO] - Starting iteration 241. [2026-03-25 17:28:15,872][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:28:15,873][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:28:19,054][__main__][INFO] - Number of regex retries in iteration 241: 0 [2026-03-25 17:28:19,055][__main__][INFO] - agents played in iteration 241 are Alice, Bob [2026-03-25 17:28:19,625][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:28:20,304][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:28:20,592][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:28:20,913][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:28:21,233][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:28:21,553][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:28:21,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:28:22,192][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:28:22,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:28:22,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:28:23,149][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:28:23,468][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:28:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:28:24,109][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:28:24,429][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:28:24,748][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:28:25,068][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:28:25,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:28:25,706][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:28:26,024][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:28:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:28:26,662][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:28:26,982][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:28:27,302][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:28:27,621][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:28:27,940][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:28:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:28:28,578][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:28:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:28:29,216][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:28:29,536][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:28:29,855][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:28:30,173][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:28:30,491][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:28:30,810][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:28:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:28:31,450][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:28:31,769][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:28:32,088][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:28:32,407][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:28:32,726][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:28:33,045][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:28:33,364][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:28:33,683][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:28:34,002][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:28:34,321][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:28:34,640][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:28:34,960][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:28:35,279][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:28:35,598][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:28:35,917][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:28:36,236][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:28:36,555][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:28:37,178][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:28:37,497][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:28:37,815][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:28:38,134][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:28:38,453][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:28:38,772][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:28:39,091][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:28:39,409][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:28:39,728][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:28:40,047][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:28:40,366][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:28:40,685][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:28:41,003][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:28:41,679][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:28:42,412][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:28:42,414][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:28:42,416][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:28:43,080][__main__][INFO] - Iteration 242 took 27s (11.69% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 39m 50s. Estimated total time: 7h 33m 28s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 20s, 500 more iterations: 3h 46m 44s. [2026-03-25 17:28:43,082][__main__][INFO] - Starting iteration 242. [2026-03-25 17:28:43,085][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:28:43,086][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:28:46,229][__main__][INFO] - Number of regex retries in iteration 242: 0 [2026-03-25 17:28:46,230][__main__][INFO] - agents played in iteration 242 are Alice, Bob [2026-03-25 17:28:46,783][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:28:47,434][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:28:47,722][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:28:48,043][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:28:48,361][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:28:48,679][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:28:48,997][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:28:49,316][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:28:49,634][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:28:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:28:50,271][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:28:50,590][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:28:50,908][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:28:51,227][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:28:51,547][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:28:51,865][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:28:52,183][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:28:52,502][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:28:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:28:53,141][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:28:53,460][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:28:53,780][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:28:54,098][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:28:54,416][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:28:54,733][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:28:55,052][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:28:55,370][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:28:55,688][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:28:56,006][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:28:56,325][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:28:56,643][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:28:56,961][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:28:57,281][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:28:57,599][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:28:57,918][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:28:58,236][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:28:58,555][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:28:58,874][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:28:59,193][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:28:59,512][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:28:59,830][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:29:00,148][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:29:00,467][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:29:00,785][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:29:01,104][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:29:01,422][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:29:01,742][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:29:02,061][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:29:02,379][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:29:02,697][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:29:03,016][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:29:03,334][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:29:03,652][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:29:04,265][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:29:04,585][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:29:04,904][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:29:05,223][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:29:05,543][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:29:05,863][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:29:06,182][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:29:06,502][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:29:06,821][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:29:07,140][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:29:07,459][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:29:07,778][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:29:08,097][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:29:08,753][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:29:09,497][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:29:09,499][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:29:09,501][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:29:10,146][__main__][INFO] - Iteration 243 took 27s (11.62% Gen, 85.99% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 36m 56s. Estimated total time: 7h 31m 2s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 6s, 500 more iterations: 3h 45m 31s. [2026-03-25 17:29:10,148][__main__][INFO] - Starting iteration 243. [2026-03-25 17:29:10,151][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:29:10,152][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:29:10,622][mllm.models.large_language_model_local][WARNING] - Response B did not match regex: (|), retry 1/1 [2026-03-25 17:29:13,330][__main__][INFO] - Number of regex retries in iteration 243: 1 [2026-03-25 17:29:13,331][__main__][INFO] - agents played in iteration 243 are Alice, Bob [2026-03-25 17:29:13,880][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:29:14,532][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:29:14,821][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:29:15,141][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:29:15,462][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:29:15,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:29:16,101][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:29:16,420][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:29:16,740][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:29:17,058][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:29:17,376][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:29:17,696][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:29:18,015][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:29:18,333][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:29:18,652][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:29:18,971][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:29:19,289][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:29:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:29:19,925][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:29:20,244][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:29:20,563][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:29:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:29:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:29:21,522][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:29:21,841][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:29:22,160][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:29:22,480][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:29:22,798][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:29:23,117][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:29:23,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:29:23,756][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:29:24,075][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:29:24,395][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:29:24,714][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:29:25,033][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:29:25,351][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:29:25,669][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:29:25,987][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:29:26,306][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:29:26,624][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:29:26,944][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:29:27,264][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:29:27,583][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:29:27,903][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:29:28,222][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:29:28,542][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:29:28,861][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:29:29,180][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:29:29,500][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:29:29,820][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:29:30,138][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:29:30,457][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:29:30,776][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:29:31,393][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:29:31,712][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:29:32,032][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:29:32,352][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:29:32,671][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:29:32,991][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:29:33,310][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:29:33,629][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:29:33,949][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:29:34,269][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:29:34,589][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:29:34,908][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:29:35,227][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:29:35,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:29:36,635][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:29:36,637][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:29:36,639][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:29:37,280][__main__][INFO] - Iteration 244 took 27s (11.72% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 37m 36s. Estimated total time: 7h 32m 9s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 12s, 500 more iterations: 3h 46m 4s. [2026-03-25 17:29:37,282][__main__][INFO] - Starting iteration 244. [2026-03-25 17:29:37,285][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:29:37,286][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:29:40,470][__main__][INFO] - Number of regex retries in iteration 244: 0 [2026-03-25 17:29:40,471][__main__][INFO] - agents played in iteration 244 are Alice, Bob [2026-03-25 17:29:41,035][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:29:41,691][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:29:41,981][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:29:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:29:42,620][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:29:42,939][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:29:43,257][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:29:43,575][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:29:43,894][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:29:44,213][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:29:44,531][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:29:44,850][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:29:45,168][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:29:45,487][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:29:45,805][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:29:46,124][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:29:46,444][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:29:46,763][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:29:47,082][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:29:47,401][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:29:47,720][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:29:48,038][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:29:48,358][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:29:48,677][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:29:48,995][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:29:49,314][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:29:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:29:49,952][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:29:50,270][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:29:50,589][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:29:50,907][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:29:51,226][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:29:51,544][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:29:51,864][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:29:52,184][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:29:52,503][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:29:52,822][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:29:53,141][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:29:53,459][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:29:53,777][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:29:54,095][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:29:54,414][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:29:54,733][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:29:55,052][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:29:55,370][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:29:55,689][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:29:56,007][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:29:56,326][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:29:56,644][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:29:56,963][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:29:57,282][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:29:57,601][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:29:57,920][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:29:58,534][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:29:58,853][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:29:59,173][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:29:59,491][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:29:59,809][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:30:00,128][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:30:00,446][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:30:00,765][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:30:01,084][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:30:01,403][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:30:01,722][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:30:02,041][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:30:02,360][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:30:03,020][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:30:03,755][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:30:03,757][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:30:03,758][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:30:04,401][__main__][INFO] - Iteration 245 took 27s (11.75% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 36m 56s. Estimated total time: 7h 31m 56s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 11s, 500 more iterations: 3h 45m 58s. [2026-03-25 17:30:04,403][__main__][INFO] - Starting iteration 245. [2026-03-25 17:30:04,406][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:30:04,406][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:30:07,608][__main__][INFO] - Number of regex retries in iteration 245: 0 [2026-03-25 17:30:07,609][__main__][INFO] - agents played in iteration 245 are Alice, Bob [2026-03-25 17:30:08,182][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:30:08,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:30:09,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:30:09,448][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:30:09,767][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:30:10,085][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:30:10,404][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:30:10,723][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:30:11,043][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:30:11,362][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:30:11,681][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:30:12,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:30:12,321][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:30:12,640][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:30:12,958][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:30:13,278][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:30:13,596][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:30:13,915][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:30:14,234][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:30:14,554][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:30:14,873][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:30:15,193][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:30:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:30:15,830][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:30:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:30:16,467][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:30:16,786][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:30:17,104][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:30:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:30:17,743][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:30:18,062][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:30:18,380][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:30:18,699][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:30:19,017][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:30:19,336][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:30:19,655][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:30:19,974][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:30:20,292][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:30:20,610][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:30:20,929][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:30:21,248][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:30:21,566][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:30:21,885][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:30:22,204][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:30:22,522][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:30:22,842][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:30:23,160][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:30:23,479][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:30:23,798][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:30:24,117][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:30:24,435][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:30:24,754][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:30:25,073][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:30:25,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:30:26,006][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:30:26,325][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:30:26,644][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:30:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:30:27,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:30:27,601][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:30:27,921][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:30:28,240][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:30:28,559][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:30:28,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:30:29,197][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:30:29,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:30:30,174][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:30:30,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:30:30,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:30:30,913][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:30:31,555][__main__][INFO] - Iteration 246 took 27s (11.79% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 37m 3s. Estimated total time: 7h 32m 30s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 15s. [2026-03-25 17:30:31,558][__main__][INFO] - Starting iteration 246. [2026-03-25 17:30:31,560][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:30:31,561][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:30:34,758][__main__][INFO] - Number of regex retries in iteration 246: 0 [2026-03-25 17:30:34,759][__main__][INFO] - agents played in iteration 246 are Alice, Bob [2026-03-25 17:30:35,340][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:30:35,997][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:30:36,287][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:30:36,607][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:30:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:30:37,245][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:30:37,565][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:30:37,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:30:38,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:30:38,521][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:30:38,839][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:30:39,158][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:30:39,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:30:39,795][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:30:40,113][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:30:40,432][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:30:40,751][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:30:41,069][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:30:41,389][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:30:41,708][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:30:42,027][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:30:42,346][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:30:42,665][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:30:42,984][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:30:43,302][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:30:43,621][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:30:43,940][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:30:44,260][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:30:44,579][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:30:44,898][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:30:45,217][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:30:45,538][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:30:45,857][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:30:46,176][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:30:46,495][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:30:46,814][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:30:47,133][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:30:47,453][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:30:47,772][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:30:48,091][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:30:48,410][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:30:48,729][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:30:49,048][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:30:49,367][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:30:49,687][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:30:50,006][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:30:50,326][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:30:50,645][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:30:50,965][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:30:51,284][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:30:51,603][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:30:51,923][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:30:52,242][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:30:52,870][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:30:53,190][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:30:53,509][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:30:53,828][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:30:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:30:54,464][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:30:54,782][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:30:55,101][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:30:55,419][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:30:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:30:56,057][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:30:56,375][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:30:56,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:30:57,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:30:58,094][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:30:58,096][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:30:58,098][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:30:58,748][__main__][INFO] - Iteration 247 took 27s (11.76% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 37m 14s. Estimated total time: 7h 33m 9s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 34s. [2026-03-25 17:30:58,751][__main__][INFO] - Starting iteration 247. [2026-03-25 17:30:58,754][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:30:58,754][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:31:01,956][__main__][INFO] - Number of regex retries in iteration 247: 0 [2026-03-25 17:31:01,957][__main__][INFO] - agents played in iteration 247 are Alice, Bob [2026-03-25 17:31:02,525][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:31:03,176][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:31:03,465][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:31:03,784][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:31:04,103][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:31:04,421][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:31:04,740][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:31:05,058][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:31:05,376][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:31:05,694][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:31:06,012][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:31:06,331][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:31:06,649][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:31:06,968][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:31:07,287][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:31:07,606][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:31:07,924][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:31:08,243][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:31:08,562][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:31:08,880][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:31:09,199][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:31:09,518][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:31:09,836][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:31:10,154][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:31:10,473][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:31:10,791][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:31:11,109][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:31:11,427][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:31:11,746][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:31:12,065][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:31:12,385][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:31:12,704][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:31:13,023][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:31:13,342][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:31:13,661][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:31:13,979][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:31:14,298][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:31:14,616][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:31:14,934][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:31:15,253][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:31:15,571][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:31:15,890][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:31:16,208][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:31:16,528][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:31:16,846][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:31:17,164][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:31:17,483][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:31:17,803][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:31:18,122][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:31:18,441][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:31:18,761][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:31:19,079][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:31:19,398][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:31:20,010][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:31:20,328][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:31:20,647][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:31:20,966][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:31:21,284][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:31:21,603][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:31:21,921][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:31:22,241][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:31:22,561][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:31:22,879][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:31:23,198][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:31:23,516][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:31:23,834][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:31:24,490][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:31:25,231][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:31:25,234][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:31:25,236][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:31:25,885][__main__][INFO] - Iteration 248 took 27s (11.80% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 35m 51s. Estimated total time: 7h 32m 12s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 6s. [2026-03-25 17:31:25,888][__main__][INFO] - Starting iteration 248. [2026-03-25 17:31:25,890][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:31:25,891][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:31:29,098][__main__][INFO] - Number of regex retries in iteration 248: 0 [2026-03-25 17:31:29,099][__main__][INFO] - agents played in iteration 248 are Alice, Bob [2026-03-25 17:31:29,672][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:31:30,328][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:31:30,617][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:31:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:31:31,255][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:31:31,575][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:31:31,893][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:31:32,212][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:31:32,531][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:31:32,851][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:31:33,170][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:31:33,488][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:31:33,808][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:31:34,127][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:31:34,446][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:31:34,765][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:31:35,084][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:31:35,403][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:31:35,723][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:31:36,042][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:31:36,362][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:31:36,680][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:31:36,998][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:31:37,317][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:31:37,636][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:31:37,956][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:31:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:31:38,595][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:31:38,914][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:31:39,233][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:31:39,551][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:31:39,870][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:31:40,190][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:31:40,508][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:31:40,827][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:31:41,146][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:31:41,465][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:31:41,784][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:31:42,102][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:31:42,422][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:31:42,740][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:31:43,060][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:31:43,379][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:31:43,698][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:31:44,016][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:31:44,335][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:31:44,654][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:31:44,972][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:31:45,291][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:31:45,610][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:31:45,928][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:31:46,248][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:31:46,566][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:31:47,179][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:31:47,497][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:31:47,816][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:31:48,135][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:31:48,453][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:31:48,772][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:31:49,091][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:31:49,409][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:31:49,729][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:31:50,048][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:31:50,367][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:31:50,686][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:31:51,006][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:31:51,663][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:31:52,409][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:31:52,411][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:31:52,413][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:31:53,080][__main__][INFO] - Iteration 249 took 27s (11.80% Gen, 85.74% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 36m 22s. Estimated total time: 7h 33m 10s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 35s. [2026-03-25 17:31:53,083][__main__][INFO] - Starting iteration 249. [2026-03-25 17:31:53,086][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:31:53,086][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:31:56,289][__main__][INFO] - Number of regex retries in iteration 249: 0 [2026-03-25 17:31:56,290][__main__][INFO] - agents played in iteration 249 are Alice, Bob [2026-03-25 17:31:56,837][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:31:57,489][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:31:57,778][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:31:58,098][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:31:58,416][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:31:58,736][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:31:59,056][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:31:59,375][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:31:59,694][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:32:00,015][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:32:00,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:32:00,655][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:32:00,975][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:32:01,294][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:32:01,614][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:32:01,934][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:32:02,253][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:32:02,573][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:32:02,892][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:32:03,212][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:32:03,531][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:32:03,851][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:32:04,171][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:32:04,490][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:32:04,809][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:32:05,129][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:32:05,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:32:05,769][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:32:06,088][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:32:06,408][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:32:06,727][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:32:07,049][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:32:07,369][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:32:07,688][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:32:08,007][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:32:08,327][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:32:08,646][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:32:08,965][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:32:09,284][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:32:09,604][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:32:09,923][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:32:10,243][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:32:10,562][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:32:10,881][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:32:11,200][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:32:11,520][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:32:11,839][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:32:12,159][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:32:12,478][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:32:12,797][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:32:13,116][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:32:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:32:13,754][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:32:14,367][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:32:14,686][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:32:15,006][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:32:15,325][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:32:15,643][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:32:15,961][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:32:16,281][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:32:16,600][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:32:16,919][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:32:17,237][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:32:17,555][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:32:17,873][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:32:18,192][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:32:18,849][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:32:19,591][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:32:19,593][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:32:19,594][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:32:20,243][__main__][INFO] - Iteration 250 took 27s (11.80% Gen, 85.81% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 35m 22s. Estimated total time: 7h 32m 38s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 19s. [2026-03-25 17:32:20,246][__main__][INFO] - Starting iteration 250. [2026-03-25 17:32:20,249][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 4 and human policies 1. [2026-03-25 17:32:20,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:32:23,455][__main__][INFO] - Number of regex retries in iteration 250: 0 [2026-03-25 17:32:23,456][__main__][INFO] - agents played in iteration 250 are Alice, Bob [2026-03-25 17:32:24,004][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:32:24,657][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:32:24,946][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:32:25,265][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:32:25,584][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:32:25,904][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:32:26,223][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:32:26,542][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:32:26,861][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:32:27,180][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:32:27,500][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:32:27,819][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:32:28,138][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:32:28,456][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:32:28,776][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:32:29,094][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:32:29,414][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:32:29,733][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:32:30,052][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:32:30,372][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:32:30,691][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:32:31,010][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:32:31,328][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:32:31,647][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:32:31,966][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:32:32,285][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:32:32,604][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:32:32,923][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:32:33,242][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:32:33,561][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:32:33,881][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:32:34,200][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:32:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:32:34,839][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:32:35,158][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:32:35,477][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:32:35,796][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:32:36,115][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:32:36,435][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:32:36,754][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:32:37,073][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:32:37,391][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:32:37,711][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:32:38,029][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:32:38,347][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:32:38,666][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:32:38,985][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:32:39,304][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:32:39,624][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:32:39,944][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:32:40,264][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:32:40,583][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:32:40,902][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:32:41,519][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:32:41,839][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:32:42,157][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:32:42,476][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:32:42,795][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:32:43,113][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:32:43,434][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:32:43,753][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:32:44,072][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:32:44,393][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:32:44,712][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:32:45,030][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:32:45,350][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:32:46,013][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:32:46,758][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:32:46,761][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:32:46,762][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:32:50,750][__main__][INFO] - Iteration 251 took 30s (10.51% Gen, 76.41% Train). Generation: 3s, Training: 23s. Estimated remaining time: 6h 30m 36s. Estimated total time: 8h 28m 22s. Time estimates for 10 more iterations: 5m 5s, 100 more iterations: 50m 50s, 500 more iterations: 4h 14m 11s. [2026-03-25 17:32:50,752][__main__][INFO] - Starting iteration 251. [2026-03-25 17:32:50,755][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:32:50,756][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:32:53,975][__main__][INFO] - Number of regex retries in iteration 251: 0 [2026-03-25 17:32:53,975][__main__][INFO] - agents played in iteration 251 are Alice, Bob [2026-03-25 17:32:54,523][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:32:55,184][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:32:55,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:32:55,793][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:32:56,112][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:32:56,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:32:56,750][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:32:57,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:32:57,388][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:32:57,707][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:32:58,027][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:32:58,347][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:32:58,667][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:32:58,986][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:32:59,305][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:32:59,624][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:32:59,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:33:00,264][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:33:00,583][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:33:00,902][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:33:01,221][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:33:01,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:33:01,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:33:02,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:33:02,497][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:33:02,817][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:33:03,137][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:33:03,456][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:33:03,775][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:33:04,094][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:33:04,413][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:33:04,731][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:33:05,050][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:33:05,369][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:33:05,688][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:33:06,007][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:33:06,326][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:33:06,646][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:33:06,966][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:33:07,284][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:33:07,602][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:33:07,921][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:33:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:33:08,559][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:33:08,878][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:33:09,196][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:33:09,514][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:33:09,833][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:33:10,153][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:33:10,472][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:33:10,791][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:33:11,109][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:33:11,428][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:33:12,045][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:33:12,364][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:33:12,684][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:33:13,004][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:33:13,324][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:33:13,643][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:33:13,963][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:33:14,283][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:33:14,603][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:33:14,924][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:33:15,244][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:33:15,563][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:33:15,883][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:33:16,556][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:33:17,299][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:33:17,301][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:33:17,303][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:33:17,950][__main__][INFO] - Iteration 252 took 27s (11.84% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 35m 2s. Estimated total time: 7h 33m 15s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 37s. [2026-03-25 17:33:17,953][__main__][INFO] - Starting iteration 252. [2026-03-25 17:33:17,955][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:33:17,956][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:33:21,185][__main__][INFO] - Number of regex retries in iteration 252: 0 [2026-03-25 17:33:21,186][__main__][INFO] - agents played in iteration 252 are Alice, Bob [2026-03-25 17:33:21,754][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:33:22,416][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:33:22,704][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:33:23,024][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:33:23,343][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:33:23,662][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:33:23,980][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:33:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:33:24,616][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:33:24,935][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:33:25,254][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:33:25,572][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:33:25,891][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:33:26,209][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:33:26,527][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:33:26,846][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:33:27,166][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:33:27,484][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:33:27,804][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:33:28,123][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:33:28,443][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:33:28,762][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:33:29,081][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:33:29,401][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:33:29,719][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:33:30,037][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:33:30,357][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:33:30,675][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:33:30,995][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:33:31,314][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:33:31,634][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:33:31,953][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:33:32,272][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:33:32,590][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:33:32,909][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:33:33,228][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:33:33,547][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:33:33,867][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:33:34,187][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:33:34,506][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:33:34,824][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:33:35,144][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:33:35,464][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:33:35,783][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:33:36,103][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:33:36,422][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:33:36,741][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:33:37,060][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:33:37,380][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:33:37,699][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:33:38,018][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:33:38,337][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:33:38,656][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:33:39,273][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:33:39,591][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:33:39,911][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:33:40,229][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:33:40,548][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:33:40,867][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:33:41,186][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:33:41,505][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:33:41,823][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:33:42,143][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:33:42,462][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:33:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:33:43,100][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:33:43,767][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:33:44,516][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:33:44,519][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:33:44,520][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:33:45,170][__main__][INFO] - Iteration 253 took 27s (11.87% Gen, 85.74% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 34m 54s. Estimated total time: 7h 33m 35s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 47s. [2026-03-25 17:33:45,172][__main__][INFO] - Starting iteration 253. [2026-03-25 17:33:45,175][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:33:45,176][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:33:48,413][__main__][INFO] - Number of regex retries in iteration 253: 0 [2026-03-25 17:33:48,414][__main__][INFO] - agents played in iteration 253 are Alice, Bob [2026-03-25 17:33:48,981][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:33:49,639][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:33:49,927][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:33:50,246][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:33:50,566][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:33:50,885][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:33:51,204][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:33:51,522][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:33:51,842][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:33:52,162][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:33:52,481][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:33:52,800][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:33:53,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:33:53,439][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:33:53,757][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:33:54,076][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:33:54,395][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:33:54,715][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:33:55,033][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:33:55,352][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:33:55,670][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:33:55,988][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:33:56,306][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:33:56,625][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:33:56,944][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:33:57,264][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:33:57,584][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:33:57,903][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:33:58,223][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:33:58,543][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:33:58,862][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:33:59,181][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:33:59,499][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:33:59,818][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:34:00,136][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:34:00,455][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:34:00,775][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:34:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:34:01,413][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:34:01,732][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:34:02,051][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:34:02,369][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:34:02,688][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:34:03,007][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:34:03,326][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:34:03,645][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:34:03,964][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:34:04,284][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:34:04,603][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:34:04,922][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:34:05,240][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:34:05,558][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:34:05,877][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:34:06,491][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:34:06,810][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:34:07,128][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:34:07,447][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:34:07,767][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:34:08,086][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:34:08,405][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:34:08,723][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:34:09,042][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:34:09,361][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:34:09,680][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:34:09,999][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:34:10,317][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:34:10,979][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:34:11,721][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:34:11,723][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:34:11,725][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:34:12,376][__main__][INFO] - Iteration 254 took 27s (11.90% Gen, 85.69% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 34m 13s. Estimated total time: 7h 33m 21s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 20s, 500 more iterations: 3h 46m 40s. [2026-03-25 17:34:12,378][__main__][INFO] - Starting iteration 254. [2026-03-25 17:34:12,381][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:34:12,382][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:34:15,578][__main__][INFO] - Number of regex retries in iteration 254: 0 [2026-03-25 17:34:15,579][__main__][INFO] - agents played in iteration 254 are Alice, Bob [2026-03-25 17:34:16,143][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:34:16,801][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:34:17,089][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:34:17,409][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:34:17,727][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:34:18,046][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:34:18,365][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:34:18,684][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:34:19,004][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:34:19,323][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:34:19,642][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:34:19,961][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:34:20,279][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:34:20,599][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:34:20,917][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:34:21,235][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:34:21,555][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:34:21,874][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:34:22,194][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:34:22,513][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:34:22,831][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:34:23,151][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:34:23,470][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:34:23,789][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:34:24,108][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:34:24,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:34:24,746][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:34:25,065][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:34:25,383][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:34:25,703][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:34:26,023][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:34:26,342][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:34:26,663][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:34:26,982][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:34:27,302][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:34:27,622][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:34:27,941][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:34:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:34:28,580][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:34:28,900][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:34:29,221][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:34:29,541][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:34:29,861][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:34:30,181][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:34:30,501][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:34:30,821][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:34:31,141][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:34:31,462][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:34:31,783][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:34:32,103][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:34:32,422][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:34:32,742][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:34:33,062][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:34:33,679][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:34:33,998][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:34:34,316][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:34:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:34:34,955][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:34:35,274][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:34:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:34:35,912][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:34:36,231][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:34:36,550][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:34:36,869][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:34:37,188][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:34:37,506][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:34:38,168][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:34:38,909][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:34:38,911][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:34:38,913][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:34:39,557][__main__][INFO] - Iteration 255 took 27s (11.76% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 33m 21s. Estimated total time: 7h 32m 57s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 28s. [2026-03-25 17:34:39,560][__main__][INFO] - Starting iteration 255. [2026-03-25 17:34:39,563][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:34:39,563][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:34:42,764][__main__][INFO] - Number of regex retries in iteration 255: 0 [2026-03-25 17:34:42,765][__main__][INFO] - agents played in iteration 255 are Alice, Bob [2026-03-25 17:34:43,337][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:34:43,993][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:34:44,282][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:34:44,601][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:34:44,920][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:34:45,239][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:34:45,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:34:45,877][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:34:46,196][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:34:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:34:46,833][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:34:47,152][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:34:47,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:34:47,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:34:48,109][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:34:48,428][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:34:48,746][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:34:49,065][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:34:49,384][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:34:49,703][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:34:50,022][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:34:50,341][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:34:50,659][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:34:50,978][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:34:51,297][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:34:51,616][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:34:51,935][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:34:52,254][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:34:52,572][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:34:52,890][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:34:53,208][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:34:53,526][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:34:53,845][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:34:54,164][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:34:54,483][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:34:54,803][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:34:55,121][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:34:55,440][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:34:55,759][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:34:56,078][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:34:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:34:56,715][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:34:57,034][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:34:57,352][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:34:57,672][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:34:57,991][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:34:58,310][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:34:58,629][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:34:58,948][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:34:59,267][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:34:59,586][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:34:59,904][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:35:00,223][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:35:00,838][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:35:01,157][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:35:01,477][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:35:01,795][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:35:02,114][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:35:02,432][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:35:02,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:35:03,069][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:35:03,389][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:35:03,707][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:35:04,025][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:35:04,344][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:35:04,665][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:35:05,329][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:35:06,080][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:35:06,082][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:35:06,084][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:35:06,737][__main__][INFO] - Iteration 256 took 27s (11.78% Gen, 85.81% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 32m 52s. Estimated total time: 7h 32m 55s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 27s. [2026-03-25 17:35:06,739][__main__][INFO] - Starting iteration 256. [2026-03-25 17:35:06,742][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:35:06,742][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:35:09,969][__main__][INFO] - Number of regex retries in iteration 256: 0 [2026-03-25 17:35:09,969][__main__][INFO] - agents played in iteration 256 are Alice, Bob [2026-03-25 17:35:10,529][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:35:11,186][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:35:11,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:35:11,795][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:35:12,114][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:35:12,432][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:35:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:35:13,070][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:35:13,388][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:35:13,707][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:35:14,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:35:14,344][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:35:14,663][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:35:14,982][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:35:15,301][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:35:15,619][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:35:15,938][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:35:16,257][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:35:16,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:35:16,893][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:35:17,213][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:35:17,531][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:35:17,849][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:35:18,168][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:35:18,487][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:35:18,806][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:35:19,124][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:35:19,443][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:35:19,762][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:35:20,080][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:35:20,399][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:35:20,717][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:35:21,037][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:35:21,356][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:35:21,674][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:35:21,993][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:35:22,312][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:35:22,631][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:35:22,950][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:35:23,269][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:35:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:35:23,906][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:35:24,225][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:35:24,544][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:35:24,863][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:35:25,182][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:35:25,500][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:35:25,820][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:35:26,139][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:35:26,457][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:35:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:35:27,094][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:35:27,413][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:35:28,028][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:35:28,348][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:35:28,667][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:35:28,986][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:35:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:35:29,625][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:35:29,944][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:35:30,263][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:35:30,581][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:35:30,900][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:35:31,219][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:35:31,539][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:35:31,859][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:35:32,520][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:35:33,257][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:35:33,259][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:35:33,261][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:35:33,903][__main__][INFO] - Iteration 257 took 27s (11.88% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 32m 12s. Estimated total time: 7h 32m 41s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 16s, 500 more iterations: 3h 46m 20s. [2026-03-25 17:35:33,905][__main__][INFO] - Starting iteration 257. [2026-03-25 17:35:33,908][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:35:33,908][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:35:37,146][__main__][INFO] - Number of regex retries in iteration 257: 0 [2026-03-25 17:35:37,147][__main__][INFO] - agents played in iteration 257 are Alice, Bob [2026-03-25 17:35:37,709][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:35:38,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:35:38,657][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:35:38,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:35:39,294][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:35:39,613][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:35:39,931][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:35:40,250][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:35:40,568][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:35:40,887][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:35:41,206][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:35:41,525][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:35:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:35:42,165][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:35:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:35:42,804][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:35:43,123][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:35:43,443][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:35:43,763][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:35:44,083][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:35:44,402][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:35:44,721][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:35:45,041][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:35:45,361][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:35:45,681][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:35:46,001][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:35:46,320][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:35:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:35:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:35:47,278][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:35:47,597][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:35:47,916][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:35:48,236][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:35:48,556][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:35:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:35:49,195][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:35:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:35:49,832][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:35:50,151][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:35:50,470][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:35:50,789][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:35:51,109][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:35:51,428][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:35:51,746][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:35:52,064][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:35:52,383][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:35:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:35:53,020][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:35:53,340][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:35:53,659][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:35:53,977][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:35:54,296][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:35:54,615][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:35:55,231][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:35:55,549][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:35:55,868][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:35:56,187][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:35:56,506][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:35:56,825][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:35:57,144][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:35:57,464][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:35:57,783][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:35:58,102][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:35:58,421][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:35:58,740][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:35:59,059][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:35:59,720][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:36:00,454][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:36:00,456][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:36:00,457][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:36:01,099][__main__][INFO] - Iteration 258 took 27s (11.91% Gen, 85.73% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 32m 15s. Estimated total time: 7h 33m 11s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 35s. [2026-03-25 17:36:01,101][__main__][INFO] - Starting iteration 258. [2026-03-25 17:36:01,104][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:36:01,104][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:36:04,304][__main__][INFO] - Number of regex retries in iteration 258: 0 [2026-03-25 17:36:04,305][__main__][INFO] - agents played in iteration 258 are Alice, Bob [2026-03-25 17:36:04,869][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:36:05,528][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:36:05,817][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:36:06,137][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:36:06,455][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:36:06,773][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:36:07,091][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:36:07,410][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:36:07,729][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:36:08,048][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:36:08,367][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:36:08,686][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:36:09,005][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:36:09,323][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:36:09,643][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:36:09,962][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:36:10,281][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:36:10,600][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:36:10,919][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:36:11,238][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:36:11,557][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:36:11,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:36:12,196][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:36:12,515][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:36:12,834][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:36:13,153][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:36:13,472][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:36:13,790][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:36:14,109][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:36:14,428][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:36:14,746][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:36:15,065][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:36:15,384][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:36:15,703][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:36:16,022][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:36:16,341][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:36:16,660][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:36:16,979][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:36:17,299][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:36:17,618][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:36:17,937][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:36:18,256][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:36:18,575][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:36:18,894][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:36:19,213][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:36:19,532][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:36:19,851][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:36:20,172][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:36:20,490][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:36:20,808][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:36:21,126][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:36:21,446][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:36:21,765][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:36:22,387][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:36:22,706][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:36:23,025][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:36:23,345][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:36:23,665][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:36:23,986][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:36:24,306][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:36:24,625][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:36:24,944][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:36:25,264][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:36:25,583][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:36:25,902][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:36:26,222][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:36:26,898][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:36:27,641][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:36:27,644][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:36:27,645][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:36:28,292][__main__][INFO] - Iteration 259 took 27s (11.77% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 31m 45s. Estimated total time: 7h 33m 9s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 34s. [2026-03-25 17:36:28,294][__main__][INFO] - Starting iteration 259. [2026-03-25 17:36:28,297][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:36:28,298][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:36:31,711][__main__][INFO] - Number of regex retries in iteration 259: 0 [2026-03-25 17:36:31,712][__main__][INFO] - agents played in iteration 259 are Alice, Bob [2026-03-25 17:36:32,273][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:36:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:36:33,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:36:33,539][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:36:33,857][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:36:34,176][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:36:34,495][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:36:34,814][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:36:35,133][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:36:35,453][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:36:35,773][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:36:36,092][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:36:36,411][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:36:36,730][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:36:37,050][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:36:37,369][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:36:37,688][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:36:38,008][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:36:38,327][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:36:38,647][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:36:38,966][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:36:39,285][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:36:39,605][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:36:39,924][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:36:40,243][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:36:40,563][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:36:40,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:36:41,200][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:36:41,519][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:36:41,839][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:36:42,157][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:36:42,477][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:36:42,797][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:36:43,116][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:36:43,436][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:36:43,755][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:36:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:36:44,394][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:36:44,712][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:36:45,032][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:36:45,352][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:36:45,671][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:36:45,989][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:36:46,307][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:36:46,626][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:36:46,946][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:36:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:36:47,584][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:36:47,902][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:36:48,222][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:36:48,541][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:36:48,861][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:36:49,181][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:36:49,798][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:36:50,117][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:36:50,436][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:36:50,755][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:36:51,074][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:36:51,393][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:36:51,713][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:36:52,032][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:36:52,351][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:36:52,670][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:36:52,990][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:36:53,309][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:36:53,629][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:36:54,292][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:36:55,036][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:36:55,038][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:36:55,040][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:36:55,694][__main__][INFO] - Iteration 260 took 27s (12.46% Gen, 85.15% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 34m 46s. Estimated total time: 7h 36m 38s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 39s, 500 more iterations: 3h 48m 19s. [2026-03-25 17:36:55,697][__main__][INFO] - Starting iteration 260. [2026-03-25 17:36:55,700][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:36:55,700][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:36:58,934][__main__][INFO] - Number of regex retries in iteration 260: 0 [2026-03-25 17:36:58,935][__main__][INFO] - agents played in iteration 260 are Alice, Bob [2026-03-25 17:36:59,503][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:37:00,186][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:37:00,474][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:37:00,794][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:37:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:37:01,434][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:37:01,753][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:37:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:37:02,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:37:02,711][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:37:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:37:03,352][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:37:03,672][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:37:03,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:37:04,312][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:37:04,631][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:37:04,951][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:37:05,270][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:37:05,589][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:37:05,908][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:37:06,227][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:37:06,546][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:37:06,866][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:37:07,185][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:37:07,505][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:37:07,824][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:37:08,145][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:37:08,464][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:37:08,785][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:37:09,104][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:37:09,424][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:37:09,744][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:37:10,063][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:37:10,382][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:37:10,702][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:37:11,022][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:37:11,342][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:37:11,661][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:37:11,980][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:37:12,299][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:37:12,619][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:37:12,939][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:37:13,260][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:37:13,579][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:37:13,899][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:37:14,217][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:37:14,536][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:37:14,855][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:37:15,175][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:37:15,494][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:37:15,813][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:37:16,132][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:37:16,452][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:37:17,073][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:37:17,393][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:37:17,712][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:37:18,032][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:37:18,352][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:37:18,671][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:37:18,990][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:37:19,309][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:37:19,627][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:37:19,946][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:37:20,266][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:37:20,587][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:37:20,906][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:37:21,580][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:37:22,371][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:37:22,373][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:37:22,374][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:37:23,032][__main__][INFO] - Iteration 261 took 27s (11.83% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 33m 14s. Estimated total time: 7h 35m 33s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 46s. [2026-03-25 17:37:23,034][__main__][INFO] - Starting iteration 261. [2026-03-25 17:37:23,037][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:37:23,037][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:37:26,353][__main__][INFO] - Number of regex retries in iteration 261: 0 [2026-03-25 17:37:26,354][__main__][INFO] - agents played in iteration 261 are Alice, Bob [2026-03-25 17:37:26,927][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:37:27,598][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:37:27,887][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:37:28,207][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:37:28,526][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:37:28,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:37:29,164][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:37:29,483][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:37:29,802][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:37:30,121][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:37:30,440][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:37:30,758][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:37:31,077][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:37:31,396][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:37:31,715][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:37:32,035][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:37:32,354][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:37:32,673][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:37:32,992][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:37:33,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:37:33,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:37:33,950][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:37:34,270][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:37:34,588][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:37:34,908][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:37:35,227][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:37:35,547][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:37:35,866][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:37:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:37:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:37:36,823][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:37:37,143][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:37:37,462][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:37:37,781][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:37:38,099][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:37:38,418][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:37:38,736][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:37:39,056][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:37:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:37:39,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:37:40,013][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:37:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:37:40,653][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:37:40,971][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:37:41,291][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:37:41,611][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:37:41,930][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:37:42,249][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:37:42,569][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:37:42,888][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:37:43,206][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:37:43,524][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:37:43,843][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:37:44,465][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:37:44,784][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:37:45,104][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:37:45,423][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:37:45,742][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:37:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:37:46,380][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:37:46,699][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:37:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:37:47,337][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:37:47,656][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:37:47,975][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:37:48,293][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:37:48,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:37:49,710][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:37:49,712][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:37:49,714][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:37:50,368][__main__][INFO] - Iteration 262 took 27s (12.14% Gen, 85.46% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 32m 46s. Estimated total time: 7h 35m 32s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 46s. [2026-03-25 17:37:50,370][__main__][INFO] - Starting iteration 262. [2026-03-25 17:37:50,373][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:37:50,374][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:37:53,626][__main__][INFO] - Number of regex retries in iteration 262: 0 [2026-03-25 17:37:53,627][__main__][INFO] - agents played in iteration 262 are Alice, Bob [2026-03-25 17:37:54,175][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:37:54,833][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:37:55,121][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:37:55,441][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:37:55,761][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:37:56,079][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:37:56,397][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:37:56,715][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:37:57,034][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:37:57,353][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:37:57,671][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:37:57,990][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:37:58,309][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:37:58,628][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:37:58,946][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:37:59,265][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:37:59,584][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:37:59,903][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:38:00,222][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:38:00,541][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:38:00,861][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:38:01,180][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:38:01,498][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:38:01,817][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:38:02,136][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:38:02,455][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:38:02,774][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:38:03,092][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:38:03,410][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:38:03,729][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:38:04,049][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:38:04,368][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:38:04,686][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:38:05,004][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:38:05,323][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:38:05,643][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:38:05,962][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:38:06,281][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:38:06,600][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:38:06,918][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:38:07,237][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:38:07,555][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:38:07,874][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:38:08,193][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:38:08,512][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:38:08,830][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:38:09,149][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:38:09,467][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:38:09,787][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:38:10,106][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:38:10,426][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:38:10,745][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:38:11,064][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:38:11,686][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:38:12,007][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:38:12,327][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:38:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:38:12,966][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:38:13,286][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:38:13,606][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:38:13,926][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:38:14,246][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:38:14,567][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:38:14,887][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:38:15,206][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:38:15,526][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:38:16,187][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:38:16,934][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:38:16,936][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:38:16,938][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:38:17,587][__main__][INFO] - Iteration 263 took 27s (11.95% Gen, 85.66% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 30m 21s. Estimated total time: 7h 33m 34s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 47s. [2026-03-25 17:38:17,589][__main__][INFO] - Starting iteration 263. [2026-03-25 17:38:17,592][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:38:17,593][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:38:20,847][__main__][INFO] - Number of regex retries in iteration 263: 0 [2026-03-25 17:38:20,847][__main__][INFO] - agents played in iteration 263 are Alice, Bob [2026-03-25 17:38:21,399][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:38:22,059][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:38:22,347][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:38:22,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:38:22,985][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:38:23,304][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:38:23,624][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:38:23,943][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:38:24,261][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:38:24,580][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:38:24,899][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:38:25,218][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:38:25,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:38:25,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:38:26,174][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:38:26,494][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:38:26,813][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:38:27,134][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:38:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:38:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:38:28,091][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:38:28,409][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:38:28,728][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:38:29,047][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:38:29,365][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:38:29,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:38:30,005][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:38:30,324][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:38:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:38:30,964][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:38:31,283][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:38:31,602][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:38:31,920][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:38:32,239][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:38:32,558][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:38:32,877][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:38:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:38:33,514][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:38:33,833][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:38:34,152][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:38:34,471][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:38:34,790][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:38:35,109][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:38:35,428][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:38:35,747][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:38:36,065][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:38:36,384][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:38:36,704][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:38:37,022][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:38:37,342][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:38:37,661][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:38:37,979][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:38:38,298][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:38:38,914][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:38:39,232][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:38:39,551][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:38:39,870][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:38:40,189][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:38:40,509][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:38:40,829][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:38:41,148][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:38:41,467][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:38:41,785][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:38:42,104][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:38:42,424][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:38:42,743][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:38:43,406][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:38:44,150][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:38:44,152][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:38:44,154][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:38:44,801][__main__][INFO] - Iteration 264 took 27s (11.96% Gen, 85.65% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 29m 49s. Estimated total time: 7h 33m 29s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 20s, 500 more iterations: 3h 46m 44s. [2026-03-25 17:38:44,803][__main__][INFO] - Starting iteration 264. [2026-03-25 17:38:44,806][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:38:44,807][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:38:48,014][__main__][INFO] - Number of regex retries in iteration 264: 0 [2026-03-25 17:38:48,015][__main__][INFO] - agents played in iteration 264 are Alice, Bob [2026-03-25 17:38:48,576][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:38:49,235][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:38:49,524][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:38:49,843][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:38:50,161][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:38:50,479][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:38:50,798][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:38:51,117][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:38:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:38:51,752][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:38:52,070][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:38:52,389][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:38:52,707][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:38:53,027][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:38:53,346][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:38:53,664][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:38:53,983][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:38:54,302][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:38:54,621][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:38:54,940][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:38:55,258][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:38:55,577][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:38:55,896][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:38:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:38:56,533][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:38:56,852][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:38:57,173][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:38:57,492][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:38:57,810][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:38:58,130][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:38:58,449][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:38:58,767][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:38:59,086][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:38:59,404][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:38:59,723][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:39:00,042][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:39:00,360][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:39:00,680][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:39:00,999][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:39:01,317][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:39:01,637][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:39:01,956][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:39:02,274][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:39:02,592][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:39:02,911][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:39:03,230][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:39:03,549][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:39:03,867][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:39:04,186][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:39:04,504][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:39:04,823][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:39:05,142][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:39:05,460][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:39:06,082][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:39:06,401][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:39:06,719][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:39:07,037][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:39:07,356][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:39:07,676][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:39:07,995][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:39:08,313][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:39:08,631][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:39:08,949][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:39:09,267][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:39:09,585][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:39:09,905][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:39:10,569][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:39:11,311][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:39:11,313][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:39:11,315][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:39:11,964][__main__][INFO] - Iteration 265 took 27s (11.81% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 28m 31s. Estimated total time: 7h 32m 39s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 19s. [2026-03-25 17:39:11,966][__main__][INFO] - Starting iteration 265. [2026-03-25 17:39:11,969][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:39:11,970][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:39:15,223][__main__][INFO] - Number of regex retries in iteration 265: 0 [2026-03-25 17:39:15,224][__main__][INFO] - agents played in iteration 265 are Alice, Bob [2026-03-25 17:39:15,774][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:39:16,445][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:39:16,734][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:39:17,053][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:39:17,371][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:39:17,689][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:39:18,008][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:39:18,327][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:39:18,646][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:39:18,965][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:39:19,284][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:39:19,602][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:39:19,921][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:39:20,240][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:39:20,558][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:39:20,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:39:21,195][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:39:21,515][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:39:21,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:39:22,155][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:39:22,473][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:39:22,792][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:39:23,112][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:39:23,431][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:39:23,751][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:39:24,070][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:39:24,389][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:39:24,708][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:39:25,028][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:39:25,348][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:39:25,666][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:39:25,987][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:39:26,307][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:39:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:39:26,947][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:39:27,266][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:39:27,585][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:39:27,905][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:39:28,224][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:39:28,544][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:39:28,864][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:39:29,185][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:39:29,504][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:39:29,824][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:39:30,144][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:39:30,464][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:39:30,783][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:39:31,102][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:39:31,422][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:39:31,740][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:39:32,059][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:39:32,377][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:39:32,696][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:39:33,317][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:39:33,636][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:39:33,955][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:39:34,274][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:39:34,592][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:39:34,912][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:39:35,230][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:39:35,549][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:39:35,868][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:39:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:39:36,505][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:39:36,823][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:39:37,141][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:39:37,817][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:39:38,559][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:39:38,562][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:39:38,563][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:39:39,219][__main__][INFO] - Iteration 266 took 27s (11.94% Gen, 85.65% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 29m 35s. Estimated total time: 7h 34m 10s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 5s. [2026-03-25 17:39:39,221][__main__][INFO] - Starting iteration 266. [2026-03-25 17:39:39,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:39:39,225][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:39:42,511][__main__][INFO] - Number of regex retries in iteration 266: 0 [2026-03-25 17:39:42,511][__main__][INFO] - agents played in iteration 266 are Alice, Bob [2026-03-25 17:39:43,085][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:39:43,755][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:39:44,044][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:39:44,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:39:44,683][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:39:45,003][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:39:45,323][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:39:45,642][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:39:45,961][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:39:46,280][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:39:46,599][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:39:46,918][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:39:47,237][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:39:47,556][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:39:47,875][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:39:48,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:39:48,513][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:39:48,833][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:39:49,153][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:39:49,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:39:49,790][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:39:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:39:50,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:39:50,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:39:51,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:39:51,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:39:51,704][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:39:52,023][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:39:52,342][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:39:52,661][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:39:52,980][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:39:53,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:39:53,616][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:39:53,936][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:39:54,256][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:39:54,574][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:39:54,892][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:39:55,211][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:39:55,530][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:39:55,848][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:39:56,166][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:39:56,484][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:39:56,804][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:39:57,123][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:39:57,442][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:39:57,762][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:39:58,081][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:39:58,400][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:39:58,719][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:39:59,037][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:39:59,357][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:39:59,676][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:39:59,994][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:40:00,610][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:40:00,929][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:40:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:40:01,566][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:40:01,885][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:40:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:40:02,524][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:40:02,844][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:40:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:40:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:40:03,801][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:40:04,120][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:40:04,439][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:40:05,101][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:40:05,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:40:05,842][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:40:05,844][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:40:06,491][__main__][INFO] - Iteration 267 took 27s (12.05% Gen, 85.57% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 29m 25s. Estimated total time: 7h 34m 27s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 13s. [2026-03-25 17:40:06,493][__main__][INFO] - Starting iteration 267. [2026-03-25 17:40:06,496][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:40:06,496][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:40:09,801][__main__][INFO] - Number of regex retries in iteration 267: 0 [2026-03-25 17:40:09,802][__main__][INFO] - agents played in iteration 267 are Alice, Bob [2026-03-25 17:40:10,375][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:40:11,036][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:40:11,324][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:40:11,644][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:40:11,963][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:40:12,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:40:12,600][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:40:12,918][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:40:13,236][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:40:13,554][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:40:13,873][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:40:14,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:40:14,511][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:40:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:40:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:40:15,465][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:40:15,784][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:40:16,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:40:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:40:16,740][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:40:17,059][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:40:17,377][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:40:17,695][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:40:18,014][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:40:18,332][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:40:18,651][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:40:18,970][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:40:19,288][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:40:19,607][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:40:19,926][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:40:20,245][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:40:20,564][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:40:20,883][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:40:21,202][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:40:21,521][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:40:21,840][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:40:22,159][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:40:22,478][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:40:22,797][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:40:23,116][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:40:23,433][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:40:23,751][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:40:24,069][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:40:24,388][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:40:24,706][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:40:25,024][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:40:25,343][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:40:25,662][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:40:25,981][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:40:26,299][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:40:26,618][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:40:26,937][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:40:27,256][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:40:27,873][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:40:28,191][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:40:28,511][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:40:28,830][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:40:29,148][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:40:29,468][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:40:29,787][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:40:30,106][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:40:30,425][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:40:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:40:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:40:31,382][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:40:31,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:40:32,362][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:40:33,099][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:40:33,101][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:40:33,103][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:40:33,751][__main__][INFO] - Iteration 268 took 27s (12.13% Gen, 85.49% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 28m 47s. Estimated total time: 7h 34m 16s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 8s. [2026-03-25 17:40:33,753][__main__][INFO] - Starting iteration 268. [2026-03-25 17:40:33,756][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:40:33,757][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:40:37,063][__main__][INFO] - Number of regex retries in iteration 268: 0 [2026-03-25 17:40:37,063][__main__][INFO] - agents played in iteration 268 are Alice, Bob [2026-03-25 17:40:37,636][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:40:38,305][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:40:38,594][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:40:38,913][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:40:39,233][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:40:39,553][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:40:39,872][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:40:40,192][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:40:40,511][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:40:40,830][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:40:41,150][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:40:41,469][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:40:41,788][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:40:42,108][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:40:42,428][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:40:42,749][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:40:43,069][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:40:43,388][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:40:43,707][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:40:44,026][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:40:44,346][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:40:44,666][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:40:44,986][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:40:45,305][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:40:45,625][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:40:45,944][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:40:46,262][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:40:46,581][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:40:46,899][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:40:47,217][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:40:47,536][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:40:47,856][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:40:48,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:40:48,494][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:40:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:40:49,130][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:40:49,448][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:40:49,767][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:40:50,086][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:40:50,405][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:40:50,724][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:40:51,043][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:40:51,362][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:40:51,681][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:40:52,000][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:40:52,318][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:40:52,636][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:40:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:40:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:40:53,590][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:40:53,909][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:40:54,228][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:40:54,547][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:40:55,163][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:40:55,481][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:40:55,799][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:40:56,118][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:40:56,437][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:40:56,755][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:40:57,073][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:40:57,393][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:40:57,712][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:40:58,030][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:40:58,348][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:40:58,667][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:40:58,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:40:59,650][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:41:00,392][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:41:00,394][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:41:00,396][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:41:01,036][__main__][INFO] - Iteration 269 took 27s (12.12% Gen, 85.53% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 28m 44s. Estimated total time: 7h 34m 40s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 20s. [2026-03-25 17:41:01,039][__main__][INFO] - Starting iteration 269. [2026-03-25 17:41:01,042][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:41:01,042][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:41:01,422][mllm.models.large_language_model_local][WARNING] - Response did not match regex: (|), retry 1/1 [2026-03-25 17:41:04,328][__main__][INFO] - Number of regex retries in iteration 269: 1 [2026-03-25 17:41:04,328][__main__][INFO] - agents played in iteration 269 are Alice, Bob [2026-03-25 17:41:04,895][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:41:05,562][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:41:05,852][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:41:06,172][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:41:06,490][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:41:06,808][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:41:07,127][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:41:07,446][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:41:07,765][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:41:08,083][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:41:08,402][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:41:08,721][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:41:09,039][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:41:09,358][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:41:09,676][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:41:09,994][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:41:10,312][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:41:10,630][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:41:10,949][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:41:11,268][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:41:11,586][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:41:11,906][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:41:12,225][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:41:12,544][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:41:12,864][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:41:13,183][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:41:13,502][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:41:13,821][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:41:14,140][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:41:14,458][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:41:14,776][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:41:15,095][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:41:15,414][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:41:15,733][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:41:16,051][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:41:16,369][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:41:16,687][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:41:17,006][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:41:17,325][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:41:17,644][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:41:17,963][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:41:18,283][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:41:18,602][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:41:18,921][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:41:19,240][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:41:19,559][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:41:19,877][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:41:20,195][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:41:20,513][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:41:20,832][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:41:21,150][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:41:21,469][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:41:21,788][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:41:22,404][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:41:22,723][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:41:23,042][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:41:23,360][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:41:23,679][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:41:23,997][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:41:24,315][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:41:24,633][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:41:24,952][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:41:25,271][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:41:25,590][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:41:25,909][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:41:26,228][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:41:26,891][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:41:27,634][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:41:27,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:41:27,638][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:41:28,282][__main__][INFO] - Iteration 270 took 27s (12.06% Gen, 85.57% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 27m 37s. Estimated total time: 7h 34m 1s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 24s, 500 more iterations: 3h 47m 0s. [2026-03-25 17:41:28,284][__main__][INFO] - Starting iteration 270. [2026-03-25 17:41:28,287][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:41:28,287][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:41:31,439][__main__][INFO] - Number of regex retries in iteration 270: 0 [2026-03-25 17:41:31,440][__main__][INFO] - agents played in iteration 270 are Alice, Bob [2026-03-25 17:41:32,008][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:41:32,666][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:41:32,956][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:41:33,276][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:41:33,596][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:41:33,915][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:41:34,233][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:41:34,552][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:41:34,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:41:35,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:41:35,508][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:41:35,827][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:41:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:41:36,465][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:41:36,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:41:37,103][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:41:37,422][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:41:37,740][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:41:38,059][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:41:38,378][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:41:38,697][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:41:39,017][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:41:39,336][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:41:39,654][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:41:39,972][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:41:40,291][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:41:40,610][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:41:40,929][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:41:41,247][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:41:41,566][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:41:41,886][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:41:42,206][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:41:42,525][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:41:42,844][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:41:43,163][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:41:43,481][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:41:43,800][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:41:44,119][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:41:44,438][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:41:44,757][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:41:45,077][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:41:45,396][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:41:45,715][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:41:46,034][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:41:46,353][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:41:46,671][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:41:46,991][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:41:47,309][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:41:47,628][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:41:47,947][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:41:48,266][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:41:48,585][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:41:48,904][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:41:49,519][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:41:49,838][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:41:50,158][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:41:50,478][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:41:50,797][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:41:51,116][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:41:51,436][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:41:51,755][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:41:52,075][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:41:52,395][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:41:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:41:53,034][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:41:53,353][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:41:54,027][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:41:54,756][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:41:54,758][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:41:54,760][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:41:55,412][__main__][INFO] - Iteration 271 took 27s (11.62% Gen, 85.97% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 25m 14s. Estimated total time: 7h 32m 5s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 12s, 500 more iterations: 3h 46m 2s. [2026-03-25 17:41:55,414][__main__][INFO] - Starting iteration 271. [2026-03-25 17:41:55,417][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:41:55,417][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:41:58,619][__main__][INFO] - Number of regex retries in iteration 271: 0 [2026-03-25 17:41:58,620][__main__][INFO] - agents played in iteration 271 are Alice, Bob [2026-03-25 17:41:59,200][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:41:59,856][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:42:00,145][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:42:00,464][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:42:00,782][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:42:01,101][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:42:01,420][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:42:01,738][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:42:02,057][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:42:02,376][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:42:02,694][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:42:03,013][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:42:03,332][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:42:03,651][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:42:03,969][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:42:04,288][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:42:04,607][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:42:04,926][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:42:05,245][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:42:05,564][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:42:05,883][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:42:06,201][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:42:06,519][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:42:06,838][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:42:07,158][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:42:07,476][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:42:07,795][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:42:08,114][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:42:08,433][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:42:08,751][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:42:09,071][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:42:09,390][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:42:09,708][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:42:10,028][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:42:10,347][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:42:10,666][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:42:10,985][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:42:11,303][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:42:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:42:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:42:12,260][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:42:12,580][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:42:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:42:13,217][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:42:13,536][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:42:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:42:14,174][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:42:14,492][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:42:14,811][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:42:15,130][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:42:15,450][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:42:15,768][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:42:16,086][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:42:16,700][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:42:17,019][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:42:17,337][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:42:17,655][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:42:17,973][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:42:18,292][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:42:18,611][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:42:18,931][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:42:19,250][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:42:19,569][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:42:19,889][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:42:20,207][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:42:20,527][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:42:21,187][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:42:21,922][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:42:21,925][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:42:21,926][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:42:22,571][__main__][INFO] - Iteration 272 took 27s (11.79% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 25m 17s. Estimated total time: 7h 32m 35s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 17s. [2026-03-25 17:42:22,574][__main__][INFO] - Starting iteration 272. [2026-03-25 17:42:22,577][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:42:22,577][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:42:25,777][__main__][INFO] - Number of regex retries in iteration 272: 0 [2026-03-25 17:42:25,778][__main__][INFO] - agents played in iteration 272 are Alice, Bob [2026-03-25 17:42:26,356][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:42:27,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:42:27,303][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:42:27,622][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:42:27,940][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:42:28,259][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:42:28,577][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:42:28,896][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:42:29,214][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:42:29,532][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:42:29,851][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:42:30,170][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:42:30,488][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:42:30,806][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:42:31,125][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:42:31,444][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:42:31,763][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:42:32,082][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:42:32,400][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:42:32,718][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:42:33,039][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:42:33,357][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:42:33,675][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:42:33,993][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:42:34,313][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:42:34,631][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:42:34,950][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:42:35,269][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:42:35,588][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:42:35,907][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:42:36,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:42:36,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:42:36,865][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:42:37,183][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:42:37,502][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:42:37,820][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:42:38,138][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:42:38,456][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:42:38,774][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:42:39,093][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:42:39,411][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:42:39,730][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:42:40,048][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:42:40,367][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:42:40,686][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:42:41,005][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:42:41,324][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:42:41,643][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:42:41,962][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:42:42,281][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:42:42,600][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:42:42,918][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:42:43,237][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:42:43,852][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:42:44,171][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:42:44,489][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:42:44,808][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:42:45,127][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:42:45,446][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:42:45,764][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:42:46,083][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:42:46,402][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:42:46,721][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:42:47,040][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:42:47,359][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:42:47,677][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:42:48,337][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:42:49,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:42:49,070][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:42:49,072][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:42:49,716][__main__][INFO] - Iteration 273 took 27s (11.79% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 24m 34s. Estimated total time: 7h 32m 19s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 9s. [2026-03-25 17:42:49,719][__main__][INFO] - Starting iteration 273. [2026-03-25 17:42:49,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:42:49,723][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:42:52,911][__main__][INFO] - Number of regex retries in iteration 273: 0 [2026-03-25 17:42:52,912][__main__][INFO] - agents played in iteration 273 are Alice, Bob [2026-03-25 17:42:53,482][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:42:54,138][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:42:54,426][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:42:54,745][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:42:55,066][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:42:55,384][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:42:55,703][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:42:56,022][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:42:56,340][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:42:56,660][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:42:56,978][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:42:57,297][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:42:57,616][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:42:57,933][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:42:58,252][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:42:58,571][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:42:58,889][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:42:59,208][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:42:59,526][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:42:59,844][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:43:00,164][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:43:00,482][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:43:00,800][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:43:01,118][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:43:01,437][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:43:01,755][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:43:02,074][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:43:02,393][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:43:02,713][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:43:03,031][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:43:03,351][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:43:03,671][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:43:03,990][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:43:04,308][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:43:04,628][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:43:04,947][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:43:05,268][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:43:05,587][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:43:05,907][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:43:06,227][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:43:06,547][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:43:06,867][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:43:07,186][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:43:07,506][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:43:07,827][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:43:08,146][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:43:08,466][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:43:08,786][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:43:09,107][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:43:09,426][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:43:09,746][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:43:10,066][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:43:10,386][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:43:11,001][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:43:11,320][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:43:11,638][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:43:11,958][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:43:12,279][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:43:12,598][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:43:12,916][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:43:13,234][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:43:13,552][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:43:13,872][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:43:14,191][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:43:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:43:14,829][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:43:15,489][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:43:16,223][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:43:16,225][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:43:16,227][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:43:16,870][__main__][INFO] - Iteration 274 took 27s (11.75% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 24m 16s. Estimated total time: 7h 32m 28s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 14s, 500 more iterations: 3h 46m 14s. [2026-03-25 17:43:16,872][__main__][INFO] - Starting iteration 274. [2026-03-25 17:43:16,875][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:43:16,876][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:43:20,033][__main__][INFO] - Number of regex retries in iteration 274: 0 [2026-03-25 17:43:20,033][__main__][INFO] - agents played in iteration 274 are Alice, Bob [2026-03-25 17:43:20,605][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:43:21,261][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:43:21,552][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:43:21,872][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:43:22,190][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:43:22,509][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:43:22,828][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:43:23,147][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:43:23,466][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:43:23,785][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:43:24,103][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:43:24,421][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:43:24,739][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:43:25,058][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:43:25,377][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:43:25,696][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:43:26,015][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:43:26,335][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:43:26,653][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:43:26,972][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:43:27,292][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:43:27,611][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:43:27,930][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:43:28,250][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:43:28,569][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:43:28,888][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:43:29,206][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:43:29,524][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:43:29,843][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:43:30,163][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:43:30,482][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:43:30,801][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:43:31,120][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:43:31,439][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:43:31,758][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:43:32,077][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:43:32,397][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:43:32,715][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:43:33,033][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:43:33,352][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:43:33,670][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:43:33,989][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:43:34,307][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:43:34,626][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:43:34,947][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:43:35,267][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:43:35,586][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:43:35,905][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:43:36,224][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:43:36,543][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:43:36,862][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:43:37,180][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:43:37,499][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:43:38,113][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:43:38,432][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:43:38,751][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:43:39,069][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:43:39,389][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:43:39,707][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:43:40,025][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:43:40,344][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:43:40,663][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:43:40,982][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:43:41,300][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:43:41,618][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:43:41,937][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:43:42,598][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:43:43,332][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:43:43,334][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:43:43,336][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:43:43,972][__main__][INFO] - Iteration 275 took 27s (11.65% Gen, 85.99% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 22m 58s. Estimated total time: 7h 31m 37s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 9s, 500 more iterations: 3h 45m 48s. [2026-03-25 17:43:43,974][__main__][INFO] - Starting iteration 275. [2026-03-25 17:43:43,977][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:43:43,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:43:47,239][__main__][INFO] - Number of regex retries in iteration 275: 0 [2026-03-25 17:43:47,240][__main__][INFO] - agents played in iteration 275 are Alice, Bob [2026-03-25 17:43:47,846][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:43:48,503][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:43:48,793][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:43:49,112][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:43:49,431][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:43:49,751][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:43:50,069][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:43:50,388][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:43:50,707][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:43:51,026][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:43:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:43:51,665][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:43:51,985][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:43:52,303][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:43:52,622][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:43:52,942][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:43:53,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:43:53,579][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:43:53,897][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:43:54,216][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:43:54,535][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:43:54,854][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:43:55,173][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:43:55,492][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:43:55,811][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:43:56,129][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:43:56,447][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:43:56,767][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:43:57,087][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:43:57,408][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:43:57,727][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:43:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:43:58,365][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:43:58,685][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:43:59,005][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:43:59,324][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:43:59,642][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:43:59,961][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:44:00,280][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:44:00,600][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:44:00,919][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:44:01,239][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:44:01,559][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:44:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:44:02,197][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:44:02,516][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:44:02,836][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:44:03,155][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:44:03,474][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:44:03,793][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:44:04,112][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:44:04,430][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:44:04,748][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:44:05,362][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:44:05,681][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:44:06,000][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:44:06,319][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:44:06,638][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:44:06,956][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:44:07,274][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:44:07,594][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:44:07,912][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:44:08,231][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:44:08,550][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:44:08,868][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:44:09,186][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:44:09,848][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:44:10,582][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:44:10,584][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:44:10,586][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:44:11,239][__main__][INFO] - Iteration 276 took 27s (11.97% Gen, 85.63% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 25m 16s. Estimated total time: 7h 34m 22s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 11s. [2026-03-25 17:44:11,241][__main__][INFO] - Starting iteration 276. [2026-03-25 17:44:11,244][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:44:11,244][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:44:14,518][__main__][INFO] - Number of regex retries in iteration 276: 0 [2026-03-25 17:44:14,519][__main__][INFO] - agents played in iteration 276 are Alice, Bob [2026-03-25 17:44:15,089][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:44:15,748][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:44:16,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:44:16,355][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:44:16,673][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:44:16,993][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:44:17,312][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:44:17,632][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:44:17,951][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:44:18,269][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:44:18,589][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:44:18,908][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:44:19,228][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:44:19,548][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:44:19,869][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:44:20,188][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:44:20,507][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:44:20,828][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:44:21,147][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:44:21,467][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:44:21,787][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:44:22,107][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:44:22,427][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:44:22,746][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:44:23,067][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:44:23,387][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:44:23,707][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:44:24,028][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:44:24,347][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:44:24,667][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:44:24,986][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:44:25,304][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:44:25,624][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:44:25,943][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:44:26,264][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:44:26,583][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:44:26,902][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:44:27,222][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:44:27,540][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:44:27,858][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:44:28,177][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:44:28,496][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:44:28,815][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:44:29,134][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:44:29,453][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:44:29,771][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:44:30,090][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:44:30,408][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:44:30,726][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:44:31,045][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:44:31,365][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:44:31,685][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:44:32,004][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:44:32,621][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:44:32,940][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:44:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:44:33,578][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:44:33,897][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:44:34,216][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:44:34,535][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:44:34,854][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:44:35,172][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:44:35,491][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:44:35,809][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:44:36,128][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:44:36,447][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:44:37,110][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:44:37,845][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:44:37,848][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:44:37,849][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:44:38,499][__main__][INFO] - Iteration 277 took 27s (12.01% Gen, 85.60% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 24m 42s. Estimated total time: 7h 34m 16s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 8s. [2026-03-25 17:44:38,502][__main__][INFO] - Starting iteration 277. [2026-03-25 17:44:38,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:44:38,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:44:41,707][__main__][INFO] - Number of regex retries in iteration 277: 0 [2026-03-25 17:44:41,707][__main__][INFO] - agents played in iteration 277 are Alice, Bob [2026-03-25 17:44:42,324][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:44:42,985][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:44:43,274][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:44:43,593][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:44:43,911][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:44:44,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:44:44,548][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:44:44,867][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:44:45,186][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:44:45,506][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:44:45,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:44:46,144][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:44:46,463][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:44:46,782][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:44:47,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:44:47,420][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:44:47,739][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:44:48,058][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:44:48,376][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:44:48,696][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:44:49,015][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:44:49,334][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:44:49,653][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:44:49,972][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:44:50,290][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:44:50,608][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:44:50,926][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:44:51,246][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:44:51,566][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:44:51,885][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:44:52,204][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:44:52,523][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:44:52,841][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:44:53,160][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:44:53,479][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:44:53,797][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:44:54,115][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:44:54,435][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:44:54,754][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:44:55,073][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:44:55,391][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:44:55,711][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:44:56,029][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:44:56,349][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:44:56,667][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:44:56,987][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:44:57,307][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:44:57,626][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:44:57,945][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:44:58,264][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:44:58,583][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:44:58,902][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:44:59,222][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:44:59,838][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:45:00,158][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:45:00,477][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:45:00,795][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:45:01,114][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:45:01,433][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:45:01,752][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:45:02,071][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:45:02,390][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:45:02,709][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:45:03,028][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:45:03,347][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:45:03,666][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:45:04,331][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:45:05,068][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:45:05,070][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:45:05,072][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:45:05,721][__main__][INFO] - Iteration 278 took 27s (11.77% Gen, 85.84% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 23m 36s. Estimated total time: 7h 33m 37s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 48s. [2026-03-25 17:45:05,723][__main__][INFO] - Starting iteration 278. [2026-03-25 17:45:05,726][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:45:05,727][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:45:08,962][__main__][INFO] - Number of regex retries in iteration 278: 0 [2026-03-25 17:45:08,962][__main__][INFO] - agents played in iteration 278 are Alice, Bob [2026-03-25 17:45:09,535][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:45:10,193][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:45:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:45:10,804][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:45:11,123][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:45:11,442][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:45:11,761][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:45:12,079][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:45:12,398][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:45:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:45:13,036][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:45:13,354][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:45:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:45:13,991][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:45:14,309][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:45:14,627][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:45:14,946][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:45:15,265][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:45:15,584][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:45:15,903][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:45:16,222][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:45:16,540][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:45:16,859][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:45:17,178][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:45:17,497][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:45:17,815][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:45:18,133][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:45:18,451][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:45:18,770][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:45:19,089][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:45:19,408][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:45:19,727][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:45:20,046][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:45:20,366][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:45:20,686][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:45:21,006][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:45:21,325][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:45:21,644][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:45:21,964][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:45:22,283][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:45:22,601][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:45:22,920][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:45:23,240][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:45:23,560][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:45:23,882][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:45:24,201][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:45:24,521][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:45:24,840][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:45:25,158][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:45:25,477][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:45:25,795][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:45:26,114][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:45:26,433][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:45:27,048][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:45:27,367][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:45:27,686][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:45:28,005][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:45:28,325][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:45:28,644][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:45:28,963][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:45:29,283][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:45:29,603][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:45:29,921][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:45:30,240][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:45:30,559][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:45:30,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:45:31,542][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:45:32,180][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:45:32,182][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:45:32,184][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:45:32,835][__main__][INFO] - Iteration 279 took 27s (11.93% Gen, 85.66% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 21m 21s. Estimated total time: 7h 31m 49s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 10s, 500 more iterations: 3h 45m 54s. [2026-03-25 17:45:32,837][__main__][INFO] - Starting iteration 279. [2026-03-25 17:45:32,840][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:45:32,841][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:45:36,172][__main__][INFO] - Number of regex retries in iteration 279: 0 [2026-03-25 17:45:36,173][__main__][INFO] - agents played in iteration 279 are Alice, Bob [2026-03-25 17:45:36,741][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:45:37,410][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:45:37,701][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:45:38,021][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:45:38,341][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:45:38,661][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:45:38,980][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:45:39,299][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:45:39,619][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:45:39,939][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:45:40,260][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:45:40,580][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:45:40,900][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:45:41,219][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:45:41,538][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:45:41,856][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:45:42,176][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:45:42,495][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:45:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:45:43,132][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:45:43,453][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:45:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:45:44,091][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:45:44,410][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:45:44,728][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:45:45,048][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:45:45,366][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:45:45,685][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:45:46,005][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:45:46,324][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:45:46,642][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:45:46,961][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:45:47,279][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:45:47,599][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:45:47,918][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:45:48,237][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:45:48,556][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:45:48,876][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:45:49,194][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:45:49,514][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:45:49,833][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:45:50,151][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:45:50,471][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:45:50,790][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:45:51,110][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:45:51,429][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:45:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:45:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:45:52,387][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:45:52,705][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:45:53,024][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:45:53,343][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:45:53,663][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:45:54,276][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:45:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:45:54,914][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:45:55,233][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:45:55,552][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:45:55,871][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:45:56,191][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:45:56,510][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:45:56,828][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:45:57,147][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:45:57,467][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:45:57,787][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:45:58,106][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:45:58,767][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:45:59,496][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:45:59,498][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:45:59,500][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:46:00,144][__main__][INFO] - Iteration 280 took 27s (12.20% Gen, 85.43% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 24m 8s. Estimated total time: 7h 35m 4s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 30s, 500 more iterations: 3h 47m 32s. [2026-03-25 17:46:00,146][__main__][INFO] - Starting iteration 280. [2026-03-25 17:46:00,149][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:46:00,149][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:46:03,382][__main__][INFO] - Number of regex retries in iteration 280: 0 [2026-03-25 17:46:03,383][__main__][INFO] - agents played in iteration 280 are Alice, Bob [2026-03-25 17:46:03,943][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:46:04,600][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:46:04,889][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:46:05,208][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:46:05,526][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:46:05,845][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:46:06,163][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:46:06,482][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:46:06,800][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:46:07,121][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:46:07,439][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:46:07,757][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:46:08,075][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:46:08,395][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:46:08,713][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:46:09,032][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:46:09,350][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:46:09,670][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:46:09,989][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:46:10,308][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:46:10,627][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:46:10,947][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:46:11,266][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:46:11,586][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:46:11,907][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:46:12,226][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:46:12,546][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:46:12,864][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:46:13,183][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:46:13,502][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:46:13,820][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:46:14,139][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:46:14,458][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:46:14,777][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:46:15,097][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:46:15,417][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:46:15,735][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:46:16,054][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:46:16,373][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:46:16,691][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:46:17,009][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:46:17,328][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:46:17,647][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:46:17,967][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:46:18,286][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:46:18,605][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:46:18,924][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:46:19,243][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:46:19,562][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:46:19,881][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:46:20,200][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:46:20,518][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:46:20,838][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:46:21,453][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:46:21,771][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:46:22,090][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:46:22,409][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:46:22,728][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:46:23,047][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:46:23,367][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:46:23,685][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:46:24,006][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:46:24,325][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:46:24,645][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:46:24,965][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:46:25,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:46:25,944][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:46:26,676][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:46:26,679][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:46:26,680][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:46:27,324][__main__][INFO] - Iteration 281 took 27s (11.90% Gen, 85.73% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 21m 33s. Estimated total time: 7h 32m 56s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 28s. [2026-03-25 17:46:27,326][__main__][INFO] - Starting iteration 281. [2026-03-25 17:46:27,329][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:46:27,330][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:46:30,611][__main__][INFO] - Number of regex retries in iteration 281: 0 [2026-03-25 17:46:30,612][__main__][INFO] - agents played in iteration 281 are Alice, Bob [2026-03-25 17:46:31,177][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:46:31,838][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:46:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:46:32,446][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:46:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:46:33,084][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:46:33,403][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:46:33,722][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:46:34,040][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:46:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:46:34,680][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:46:34,999][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:46:35,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:46:35,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:46:35,957][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:46:36,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:46:36,593][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:46:36,912][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:46:37,231][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:46:37,550][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:46:37,869][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:46:38,188][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:46:38,508][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:46:38,827][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:46:39,147][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:46:39,466][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:46:39,787][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:46:40,106][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:46:40,425][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:46:40,745][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:46:41,064][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:46:41,383][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:46:41,702][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:46:42,022][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:46:42,340][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:46:42,659][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:46:42,978][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:46:43,296][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:46:43,615][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:46:43,934][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:46:44,253][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:46:44,573][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:46:44,891][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:46:45,210][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:46:45,529][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:46:45,849][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:46:46,168][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:46:46,489][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:46:46,809][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:46:47,128][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:46:47,449][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:46:47,769][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:46:48,088][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:46:48,709][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:46:49,029][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:46:49,348][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:46:49,668][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:46:49,988][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:46:50,307][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:46:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:46:50,947][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:46:51,267][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:46:51,587][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:46:51,907][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:46:52,227][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:46:52,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:46:53,212][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:46:53,952][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:46:53,954][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:46:53,956][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:46:54,609][__main__][INFO] - Iteration 282 took 27s (12.03% Gen, 85.57% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 22m 50s. Estimated total time: 7h 34m 40s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 20s. [2026-03-25 17:46:54,611][__main__][INFO] - Starting iteration 282. [2026-03-25 17:46:54,614][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:46:54,615][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:46:57,825][__main__][INFO] - Number of regex retries in iteration 282: 0 [2026-03-25 17:46:57,826][__main__][INFO] - agents played in iteration 282 are Alice, Bob [2026-03-25 17:46:58,381][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:46:59,042][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:46:59,331][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:46:59,650][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:46:59,969][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:47:00,288][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:47:00,607][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:47:00,926][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:47:01,247][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:47:01,567][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:47:01,886][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:47:02,205][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:47:02,524][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:47:02,843][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:47:03,162][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:47:03,481][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:47:03,799][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:47:04,118][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:47:04,436][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:47:04,755][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:47:05,074][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:47:05,393][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:47:05,713][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:47:06,034][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:47:06,353][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:47:06,672][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:47:06,991][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:47:07,310][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:47:07,630][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:47:07,948][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:47:08,266][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:47:08,585][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:47:08,903][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:47:09,221][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:47:09,541][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:47:09,859][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:47:10,179][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:47:10,497][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:47:10,817][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:47:11,137][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:47:11,457][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:47:11,776][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:47:12,095][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:47:12,414][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:47:12,733][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:47:13,051][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:47:13,370][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:47:13,688][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:47:14,007][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:47:14,326][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:47:14,645][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:47:14,964][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:47:15,284][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:47:15,899][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:47:16,218][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:47:16,538][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:47:16,857][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:47:17,176][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:47:17,495][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:47:17,815][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:47:18,133][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:47:18,453][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:47:18,771][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:47:19,089][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:47:19,408][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:47:19,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:47:20,390][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:47:21,131][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:47:21,133][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:47:21,135][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:47:21,775][__main__][INFO] - Iteration 283 took 27s (11.82% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 20m 24s. Estimated total time: 7h 32m 41s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 16s, 500 more iterations: 3h 46m 20s. [2026-03-25 17:47:21,777][__main__][INFO] - Starting iteration 283. [2026-03-25 17:47:21,780][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:47:21,781][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:47:25,028][__main__][INFO] - Number of regex retries in iteration 283: 0 [2026-03-25 17:47:25,029][__main__][INFO] - agents played in iteration 283 are Alice, Bob [2026-03-25 17:47:25,596][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:47:26,257][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:47:26,546][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:47:26,866][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:47:27,186][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:47:27,505][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:47:27,824][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:47:28,143][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:47:28,461][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:47:28,779][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:47:29,097][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:47:29,417][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:47:29,735][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:47:30,055][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:47:30,373][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:47:30,694][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:47:31,012][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:47:31,330][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:47:31,648][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:47:31,967][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:47:32,287][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:47:32,606][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:47:32,925][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:47:33,244][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:47:33,563][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:47:33,881][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:47:34,201][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:47:34,519][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:47:34,838][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:47:35,157][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:47:35,476][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:47:35,795][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:47:36,114][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:47:36,434][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:47:36,753][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:47:37,072][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:47:37,391][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:47:37,710][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:47:38,028][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:47:38,347][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:47:38,666][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:47:38,984][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:47:39,303][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:47:39,622][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:47:39,941][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:47:40,261][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:47:40,580][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:47:40,899][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:47:41,218][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:47:41,537][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:47:41,855][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:47:42,175][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:47:42,494][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:47:43,110][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:47:43,429][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:47:43,747][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:47:44,068][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:47:44,387][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:47:44,707][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:47:45,026][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:47:45,346][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:47:45,665][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:47:45,984][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:47:46,303][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:47:46,621][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:47:46,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:47:47,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:47:48,345][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:47:48,347][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:47:48,348][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:47:48,997][__main__][INFO] - Iteration 284 took 27s (11.93% Gen, 85.68% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 20m 53s. Estimated total time: 7h 33m 37s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 48s. [2026-03-25 17:47:48,999][__main__][INFO] - Starting iteration 284. [2026-03-25 17:47:49,002][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:47:49,003][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:47:52,238][__main__][INFO] - Number of regex retries in iteration 284: 0 [2026-03-25 17:47:52,239][__main__][INFO] - agents played in iteration 284 are Alice, Bob [2026-03-25 17:47:52,794][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:47:53,455][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:47:53,745][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:47:54,064][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:47:54,383][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:47:54,701][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:47:55,020][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:47:55,339][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:47:55,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:47:55,978][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:47:56,297][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:47:56,616][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:47:56,936][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:47:57,255][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:47:57,575][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:47:57,894][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:47:58,212][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:47:58,531][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:47:58,850][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:47:59,170][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:47:59,490][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:47:59,811][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:48:00,131][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:48:00,450][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:48:00,770][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:48:01,089][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:48:01,409][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:48:01,729][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:48:02,049][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:48:02,369][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:48:02,689][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:48:03,010][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:48:03,329][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:48:03,649][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:48:03,968][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:48:04,289][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:48:04,609][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:48:04,929][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:48:05,249][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:48:05,570][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:48:05,889][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:48:06,209][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:48:06,529][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:48:06,849][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:48:07,168][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:48:07,488][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:48:07,809][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:48:08,128][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:48:08,448][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:48:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:48:09,087][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:48:09,409][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:48:09,729][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:48:10,345][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:48:10,664][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:48:10,983][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:48:11,302][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:48:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:48:11,939][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:48:12,258][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:48:12,577][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:48:12,896][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:48:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:48:13,535][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:48:13,854][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:48:14,173][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:48:14,834][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:48:15,574][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:48:15,576][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:48:15,578][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:48:16,222][__main__][INFO] - Iteration 285 took 27s (11.89% Gen, 85.74% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 20m 29s. Estimated total time: 7h 33m 41s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 22s, 500 more iterations: 3h 46m 50s. [2026-03-25 17:48:16,224][__main__][INFO] - Starting iteration 285. [2026-03-25 17:48:16,227][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:48:16,228][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:48:19,460][__main__][INFO] - Number of regex retries in iteration 285: 0 [2026-03-25 17:48:19,461][__main__][INFO] - agents played in iteration 285 are Alice, Bob [2026-03-25 17:48:20,016][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:48:20,672][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:48:20,963][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:48:21,282][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:48:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:48:21,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:48:22,238][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:48:22,557][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:48:22,876][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:48:23,195][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:48:23,514][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:48:23,834][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:48:24,153][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:48:24,471][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:48:24,791][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:48:25,110][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:48:25,428][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:48:25,746][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:48:26,065][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:48:26,384][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:48:26,703][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:48:27,022][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:48:27,341][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:48:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:48:27,979][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:48:28,298][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:48:28,618][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:48:28,937][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:48:29,256][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:48:29,575][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:48:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:48:30,212][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:48:30,531][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:48:30,850][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:48:31,169][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:48:31,488][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:48:31,807][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:48:32,127][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:48:32,446][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:48:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:48:33,085][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:48:33,403][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:48:33,721][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:48:34,040][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:48:34,359][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:48:34,678][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:48:34,997][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:48:35,317][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:48:35,637][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:48:35,956][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:48:36,276][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:48:36,594][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:48:36,914][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:48:37,530][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:48:37,850][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:48:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:48:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:48:38,807][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:48:39,127][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:48:39,446][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:48:39,765][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:48:40,084][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:48:40,403][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:48:40,722][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:48:41,040][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:48:41,359][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:48:42,015][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:48:42,755][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:48:42,758][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:48:42,759][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:48:43,404][__main__][INFO] - Iteration 286 took 27s (11.89% Gen, 85.73% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 19m 19s. Estimated total time: 7h 32m 57s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 28s. [2026-03-25 17:48:43,407][__main__][INFO] - Starting iteration 286. [2026-03-25 17:48:43,409][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:48:43,410][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:48:46,603][__main__][INFO] - Number of regex retries in iteration 286: 0 [2026-03-25 17:48:46,604][__main__][INFO] - agents played in iteration 286 are Alice, Bob [2026-03-25 17:48:47,159][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:48:47,811][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:48:48,100][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:48:48,419][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:48:48,738][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:48:49,057][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:48:49,375][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:48:49,694][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:48:50,012][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:48:50,331][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:48:50,650][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:48:50,967][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:48:51,285][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:48:51,605][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:48:51,923][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:48:52,243][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:48:52,562][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:48:52,880][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:48:53,200][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:48:53,519][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:48:53,838][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:48:54,157][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:48:54,478][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:48:54,797][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:48:55,116][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:48:55,436][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:48:55,754][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:48:56,073][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:48:56,391][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:48:56,710][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:48:57,029][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:48:57,349][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:48:57,668][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:48:57,987][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:48:58,306][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:48:58,625][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:48:58,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:48:59,264][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:48:59,583][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:48:59,901][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:49:00,220][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:49:00,539][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:49:00,858][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:49:01,178][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:49:01,497][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:49:01,815][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:49:02,135][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:49:02,455][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:49:02,772][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:49:03,091][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:49:03,409][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:49:03,728][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:49:04,047][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:49:04,660][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:49:04,978][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:49:05,297][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:49:05,616][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:49:05,935][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:49:06,254][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:49:06,573][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:49:06,891][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:49:07,210][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:49:07,529][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:49:07,848][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:49:08,166][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:49:08,485][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:49:09,141][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:49:09,887][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:49:09,889][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:49:09,890][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:49:10,539][__main__][INFO] - Iteration 287 took 27s (11.77% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 18m 4s. Estimated total time: 7h 32m 10s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 5s. [2026-03-25 17:49:10,541][__main__][INFO] - Starting iteration 287. [2026-03-25 17:49:10,544][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:49:10,544][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:49:13,770][__main__][INFO] - Number of regex retries in iteration 287: 0 [2026-03-25 17:49:13,771][__main__][INFO] - agents played in iteration 287 are Alice, Bob [2026-03-25 17:49:14,335][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:49:14,990][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:49:15,279][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:49:15,599][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:49:15,919][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:49:16,238][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:49:16,558][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:49:16,877][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:49:17,198][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:49:17,518][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:49:17,838][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:49:18,157][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:49:18,477][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:49:18,796][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:49:19,115][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:49:19,435][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:49:19,754][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:49:20,074][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:49:20,395][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:49:20,716][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:49:21,036][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:49:21,356][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:49:21,676][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:49:21,996][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:49:22,316][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:49:22,634][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:49:22,954][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:49:23,274][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:49:23,594][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:49:23,913][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:49:24,233][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:49:24,552][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:49:24,872][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:49:25,191][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:49:25,510][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:49:25,829][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:49:26,148][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:49:26,468][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:49:26,788][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:49:27,108][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:49:27,427][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:49:27,747][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:49:28,066][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:49:28,385][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:49:28,704][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:49:29,023][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:49:29,342][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:49:29,662][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:49:29,981][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:49:30,300][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:49:30,618][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:49:30,937][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:49:31,255][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:49:31,866][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:49:32,185][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:49:32,503][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:49:32,823][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:49:33,142][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:49:33,461][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:49:33,780][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:49:34,099][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:49:34,418][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:49:34,738][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:49:35,056][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:49:35,374][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:49:35,694][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:49:36,350][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:49:37,080][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:49:37,082][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:49:37,084][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:49:37,731][__main__][INFO] - Iteration 288 took 27s (11.87% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 18m 34s. Estimated total time: 7h 33m 7s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 33s. [2026-03-25 17:49:37,733][__main__][INFO] - Starting iteration 288. [2026-03-25 17:49:37,736][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:49:37,737][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:49:40,987][__main__][INFO] - Number of regex retries in iteration 288: 0 [2026-03-25 17:49:40,987][__main__][INFO] - agents played in iteration 288 are Alice, Bob [2026-03-25 17:49:41,545][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:49:42,197][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:49:42,486][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:49:42,806][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:49:43,125][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:49:43,444][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:49:43,763][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:49:44,082][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:49:44,401][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:49:44,721][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:49:45,040][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:49:45,359][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:49:45,678][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:49:45,997][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:49:46,315][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:49:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:49:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:49:47,271][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:49:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:49:47,909][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:49:48,228][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:49:48,547][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:49:48,866][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:49:49,186][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:49:49,507][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:49:49,827][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:49:50,148][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:49:50,467][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:49:50,787][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:49:51,106][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:49:51,425][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:49:51,744][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:49:52,063][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:49:52,382][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:49:52,701][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:49:53,020][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:49:53,338][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:49:53,657][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:49:53,975][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:49:54,294][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:49:54,613][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:49:54,932][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:49:55,250][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:49:55,569][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:49:55,889][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:49:56,208][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:49:56,527][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:49:56,846][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:49:57,165][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:49:57,484][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:49:57,803][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:49:58,121][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:49:58,439][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:49:59,052][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:49:59,371][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:49:59,691][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:50:00,010][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:50:00,328][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:50:00,647][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:50:00,965][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:50:01,285][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:50:01,604][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:50:01,923][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:50:02,241][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:50:02,560][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:50:02,879][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:50:03,536][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:50:04,281][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:50:04,283][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:50:04,285][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:50:04,948][__main__][INFO] - Iteration 289 took 27s (11.95% Gen, 85.61% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 18m 32s. Estimated total time: 7h 33m 33s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 46s. [2026-03-25 17:50:04,950][__main__][INFO] - Starting iteration 289. [2026-03-25 17:50:04,953][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:50:04,954][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:50:08,197][__main__][INFO] - Number of regex retries in iteration 289: 0 [2026-03-25 17:50:08,198][__main__][INFO] - agents played in iteration 289 are Alice, Bob [2026-03-25 17:50:08,754][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:50:09,406][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:50:09,696][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:50:10,014][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:50:10,334][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:50:10,652][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:50:10,970][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:50:11,288][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:50:11,607][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:50:11,925][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:50:12,244][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:50:12,563][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:50:12,882][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:50:13,200][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:50:13,518][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:50:13,837][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:50:14,155][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:50:14,473][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:50:14,793][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:50:15,112][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:50:15,430][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:50:15,749][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:50:16,068][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:50:16,388][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:50:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:50:17,028][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:50:17,347][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:50:17,666][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:50:17,986][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:50:18,305][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:50:18,624][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:50:18,943][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:50:19,261][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:50:19,581][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:50:19,900][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:50:20,219][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:50:20,538][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:50:20,856][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:50:21,175][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:50:21,494][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:50:21,813][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:50:22,132][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:50:22,451][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:50:22,769][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:50:23,088][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:50:23,406][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:50:23,725][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:50:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:50:24,364][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:50:24,682][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:50:25,002][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:50:25,321][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:50:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:50:26,253][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:50:26,572][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:50:26,892][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:50:27,211][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:50:27,530][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:50:27,849][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:50:28,168][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:50:28,491][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:50:28,811][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:50:29,132][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:50:29,452][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:50:29,772][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:50:30,092][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:50:30,796][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:50:31,540][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:50:31,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:50:31,543][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:50:32,241][__main__][INFO] - Iteration 290 took 27s (11.89% Gen, 85.55% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 19m 21s. Estimated total time: 7h 34m 49s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 24s. [2026-03-25 17:50:32,245][__main__][INFO] - Starting iteration 290. [2026-03-25 17:50:32,248][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:50:32,249][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:50:35,601][__main__][INFO] - Number of regex retries in iteration 290: 0 [2026-03-25 17:50:35,602][__main__][INFO] - agents played in iteration 290 are Alice, Bob [2026-03-25 17:50:36,221][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:50:36,916][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:50:37,206][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:50:37,528][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:50:37,848][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:50:38,168][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:50:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:50:38,808][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:50:39,128][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:50:39,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:50:39,767][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:50:40,087][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:50:40,406][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:50:40,725][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:50:41,044][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:50:41,364][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:50:41,683][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:50:42,003][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:50:42,323][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:50:42,642][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:50:42,960][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:50:43,279][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:50:43,598][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:50:43,917][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:50:44,236][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:50:44,554][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:50:44,873][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:50:45,192][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:50:45,511][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:50:45,829][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:50:46,148][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:50:46,467][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:50:46,788][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:50:47,107][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:50:47,427][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:50:47,746][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:50:48,066][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:50:48,385][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:50:48,705][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:50:49,023][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:50:49,341][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:50:49,659][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:50:49,978][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:50:50,297][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:50:50,616][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:50:50,934][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:50:51,253][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:50:51,571][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:50:51,891][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:50:52,211][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:50:52,531][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:50:52,849][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:50:53,168][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:50:53,815][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:50:54,134][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:50:54,453][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:50:54,773][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:50:55,093][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:50:55,412][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:50:55,731][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:50:56,050][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:50:56,368][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:50:56,686][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:50:57,005][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:50:57,324][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:50:57,642][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:50:58,298][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:50:59,031][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:50:59,033][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:50:59,035][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:50:59,771][__main__][INFO] - Iteration 291 took 27s (12.18% Gen, 85.14% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 22m 49s. Estimated total time: 7h 38m 44s. Time estimates for 10 more iterations: 4m 35s, 100 more iterations: 45m 52s, 500 more iterations: 3h 49m 22s. [2026-03-25 17:50:59,774][__main__][INFO] - Starting iteration 291. [2026-03-25 17:50:59,777][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:50:59,777][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:51:03,152][__main__][INFO] - Number of regex retries in iteration 291: 0 [2026-03-25 17:51:03,153][__main__][INFO] - agents played in iteration 291 are Alice, Bob [2026-03-25 17:51:03,731][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:51:04,383][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:51:04,673][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:51:04,992][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:51:05,310][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:51:05,629][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:51:05,948][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:51:06,269][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:51:06,588][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:51:06,908][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:51:07,227][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:51:07,546][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:51:07,864][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:51:08,182][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:51:08,500][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:51:08,818][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:51:09,137][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:51:09,457][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:51:09,776][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:51:10,094][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:51:10,412][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:51:10,731][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:51:11,049][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:51:11,369][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:51:11,688][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:51:12,007][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:51:12,327][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:51:12,645][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:51:12,964][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:51:13,283][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:51:13,602][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:51:13,921][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:51:14,240][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:51:14,558][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:51:14,877][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:51:15,195][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:51:15,514][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:51:15,833][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:51:16,151][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:51:16,469][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:51:16,788][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:51:17,108][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:51:17,428][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:51:17,747][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:51:18,066][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:51:18,385][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:51:18,703][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:51:19,023][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:51:19,341][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:51:19,660][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:51:19,978][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:51:20,297][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:51:20,616][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:51:21,257][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:51:21,576][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:51:21,894][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:51:22,213][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:51:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:51:22,850][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:51:23,169][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:51:23,489][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:51:23,808][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:51:24,127][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:51:24,446][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:51:24,767][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:51:25,086][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:51:25,758][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:51:26,510][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:51:26,512][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:51:26,514][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:51:27,180][__main__][INFO] - Iteration 292 took 27s (12.32% Gen, 85.25% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 20m 21s. Estimated total time: 7h 36m 44s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 22s. [2026-03-25 17:51:27,182][__main__][INFO] - Starting iteration 292. [2026-03-25 17:51:27,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:51:27,185][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:51:30,441][__main__][INFO] - Number of regex retries in iteration 292: 0 [2026-03-25 17:51:30,442][__main__][INFO] - agents played in iteration 292 are Alice, Bob [2026-03-25 17:51:31,032][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:51:31,684][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:51:31,974][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:51:32,294][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:51:32,612][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:51:32,931][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:51:33,249][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:51:33,568][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:51:33,886][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:51:34,205][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:51:34,524][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:51:34,842][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:51:35,160][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:51:35,480][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:51:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:51:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:51:36,438][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:51:36,757][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:51:37,075][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:51:37,393][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:51:37,712][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:51:38,031][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:51:38,350][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:51:38,669][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:51:38,988][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:51:39,307][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:51:39,626][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:51:39,945][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:51:40,264][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:51:40,582][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:51:40,901][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:51:41,220][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:51:41,539][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:51:41,857][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:51:42,175][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:51:42,494][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:51:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:51:43,132][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:51:43,451][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:51:43,769][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:51:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:51:44,408][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:51:44,728][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:51:45,048][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:51:45,367][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:51:45,687][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:51:46,007][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:51:46,328][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:51:46,648][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:51:46,968][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:51:47,289][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:51:47,609][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:51:47,928][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:51:48,583][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:51:48,903][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:51:49,222][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:51:49,542][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:51:49,862][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:51:50,181][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:51:50,501][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:51:50,819][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:51:51,138][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:51:51,457][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:51:51,777][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:51:52,097][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:51:52,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:51:53,073][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:51:53,809][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:51:53,811][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:51:53,813][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:51:54,480][__main__][INFO] - Iteration 293 took 27s (11.93% Gen, 85.62% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 18m 6s. Estimated total time: 7h 34m 56s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 29s, 500 more iterations: 3h 47m 28s. [2026-03-25 17:51:54,483][__main__][INFO] - Starting iteration 293. [2026-03-25 17:51:54,486][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:51:54,486][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:51:57,787][__main__][INFO] - Number of regex retries in iteration 293: 0 [2026-03-25 17:51:57,788][__main__][INFO] - agents played in iteration 293 are Alice, Bob [2026-03-25 17:51:58,380][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:51:59,032][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:51:59,325][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:51:59,645][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:51:59,964][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:52:00,283][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:52:00,601][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:52:00,919][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:52:01,238][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:52:01,556][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:52:01,875][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:52:02,193][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:52:02,511][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:52:02,831][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:52:03,150][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:52:03,469][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:52:03,788][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:52:04,106][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:52:04,424][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:52:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:52:05,062][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:52:05,380][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:52:05,698][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:52:06,017][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:52:06,335][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:52:06,655][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:52:06,974][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:52:07,292][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:52:07,612][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:52:07,930][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:52:08,249][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:52:08,569][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:52:08,888][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:52:09,208][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:52:09,527][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:52:09,846][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:52:10,165][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:52:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:52:10,802][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:52:11,121][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:52:11,440][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:52:11,760][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:52:12,079][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:52:12,398][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:52:12,716][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:52:13,035][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:52:13,353][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:52:13,671][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:52:13,990][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:52:14,308][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:52:14,628][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:52:14,947][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:52:15,266][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:52:15,916][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:52:16,235][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:52:16,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:52:16,872][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:52:17,191][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:52:17,510][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:52:17,828][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:52:18,147][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:52:18,466][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:52:18,785][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:52:19,104][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:52:19,423][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:52:19,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:52:20,403][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:52:21,133][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:52:21,135][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:52:21,137][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:52:21,803][__main__][INFO] - Iteration 294 took 27s (12.09% Gen, 85.47% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 18m 0s. Estimated total time: 7h 35m 18s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 39s. [2026-03-25 17:52:21,805][__main__][INFO] - Starting iteration 294. [2026-03-25 17:52:21,808][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:52:21,809][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:52:25,109][__main__][INFO] - Number of regex retries in iteration 294: 0 [2026-03-25 17:52:25,109][__main__][INFO] - agents played in iteration 294 are Alice, Bob [2026-03-25 17:52:25,680][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:52:26,331][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:52:26,620][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:52:26,940][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:52:27,258][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:52:27,576][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:52:27,894][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:52:28,213][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:52:28,532][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:52:28,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:52:29,168][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:52:29,487][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:52:29,806][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:52:30,125][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:52:30,444][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:52:30,763][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:52:31,081][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:52:31,401][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:52:31,720][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:52:32,039][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:52:32,357][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:52:32,676][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:52:32,995][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:52:33,314][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:52:33,632][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:52:33,952][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:52:34,271][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:52:34,589][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:52:34,908][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:52:35,227][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:52:35,547][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:52:35,866][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:52:36,185][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:52:36,504][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:52:36,824][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:52:37,142][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:52:37,461][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:52:37,780][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:52:38,099][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:52:38,418][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:52:38,737][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:52:39,057][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:52:39,376][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:52:39,695][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:52:40,014][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:52:40,333][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:52:40,651][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:52:40,970][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:52:41,290][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:52:41,609][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:52:41,929][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:52:42,248][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:52:42,567][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:52:43,180][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:52:43,498][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:52:43,817][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:52:44,136][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:52:44,455][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:52:44,773][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:52:45,093][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:52:45,411][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:52:45,731][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:52:46,050][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:52:46,369][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:52:46,688][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:52:47,007][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:52:47,662][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:52:48,388][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:52:48,390][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:52:48,391][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:52:49,051][__main__][INFO] - Iteration 295 took 27s (12.12% Gen, 85.46% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 16m 19s. Estimated total time: 7h 34m 4s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 24s, 500 more iterations: 3h 47m 2s. [2026-03-25 17:52:49,053][__main__][INFO] - Starting iteration 295. [2026-03-25 17:52:49,057][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:52:49,057][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:52:52,276][__main__][INFO] - Number of regex retries in iteration 295: 0 [2026-03-25 17:52:52,276][__main__][INFO] - agents played in iteration 295 are Alice, Bob [2026-03-25 17:52:52,847][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:52:53,499][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:52:53,789][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:52:54,109][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:52:54,429][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:52:54,747][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:52:55,066][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:52:55,386][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:52:55,706][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:52:56,025][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:52:56,345][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:52:56,665][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:52:56,985][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:52:57,305][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:52:57,624][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:52:57,943][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:52:58,261][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:52:58,581][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:52:58,901][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:52:59,221][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:52:59,541][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:52:59,861][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:53:00,181][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:53:00,500][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:53:00,820][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:53:01,139][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:53:01,461][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:53:01,781][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:53:02,102][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:53:02,422][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:53:02,741][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:53:03,060][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:53:03,380][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:53:03,700][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:53:04,019][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:53:04,339][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:53:04,658][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:53:04,977][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:53:05,297][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:53:05,617][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:53:05,937][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:53:06,257][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:53:06,575][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:53:06,894][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:53:07,212][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:53:07,531][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:53:07,849][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:53:08,168][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:53:08,488][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:53:08,807][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:53:09,126][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:53:09,446][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:53:09,765][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:53:10,379][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:53:10,698][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:53:11,018][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:53:11,336][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:53:11,655][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:53:11,975][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:53:12,294][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:53:12,613][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:53:12,932][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:53:13,253][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:53:13,572][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:53:13,890][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:53:14,209][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:53:14,866][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:53:15,601][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:53:15,605][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:53:15,607][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:53:16,266][__main__][INFO] - Iteration 296 took 27s (11.83% Gen, 85.74% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 15m 18s. Estimated total time: 7h 33m 30s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 45s. [2026-03-25 17:53:16,271][__main__][INFO] - Starting iteration 296. [2026-03-25 17:53:16,275][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:53:16,275][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:53:19,503][__main__][INFO] - Number of regex retries in iteration 296: 0 [2026-03-25 17:53:19,504][__main__][INFO] - agents played in iteration 296 are Alice, Bob [2026-03-25 17:53:20,069][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:53:20,724][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:53:21,012][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:53:21,331][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:53:21,650][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:53:21,969][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:53:22,288][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:53:22,606][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:53:22,925][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:53:23,244][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:53:23,563][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:53:23,882][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:53:24,202][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:53:24,522][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:53:24,841][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:53:25,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:53:25,478][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:53:25,796][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:53:26,115][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:53:26,435][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:53:26,754][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:53:27,073][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:53:27,393][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:53:27,712][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:53:28,030][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:53:28,349][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:53:28,668][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:53:28,987][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:53:29,306][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:53:29,626][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:53:29,945][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:53:30,265][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:53:30,584][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:53:30,903][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:53:31,223][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:53:31,542][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:53:31,862][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:53:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:53:32,500][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:53:32,819][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:53:33,138][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:53:33,456][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:53:33,775][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:53:34,095][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:53:34,414][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:53:34,733][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:53:35,053][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:53:35,372][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:53:35,692][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:53:36,011][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:53:36,331][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:53:36,651][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:53:36,970][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:53:37,585][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:53:37,904][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:53:38,223][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:53:38,541][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:53:38,860][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:53:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:53:39,499][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:53:39,818][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:53:40,137][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:53:40,456][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:53:40,775][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:53:41,094][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:53:41,414][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:53:42,072][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:53:42,832][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:53:42,834][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:53:42,836][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:53:43,467][__main__][INFO] - Iteration 297 took 27s (11.87% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 14m 34s. Estimated total time: 7h 33m 13s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 36s. [2026-03-25 17:53:43,470][__main__][INFO] - Starting iteration 297. [2026-03-25 17:53:43,475][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:53:43,475][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:53:46,804][__main__][INFO] - Number of regex retries in iteration 297: 0 [2026-03-25 17:53:46,805][__main__][INFO] - agents played in iteration 297 are Alice, Bob [2026-03-25 17:53:47,381][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:53:48,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:53:48,322][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:53:48,641][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:53:48,959][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:53:49,278][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:53:49,597][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:53:49,916][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:53:50,234][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:53:50,553][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:53:50,872][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:53:51,192][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:53:51,510][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:53:51,828][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:53:52,146][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:53:52,465][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:53:52,784][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:53:53,103][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:53:53,422][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:53:53,741][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:53:54,060][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:53:54,379][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:53:54,699][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:53:55,018][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:53:55,336][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:53:55,654][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:53:55,972][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:53:56,291][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:53:56,610][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:53:56,929][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:53:57,248][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:53:57,566][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:53:57,885][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:53:58,204][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:53:58,523][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:53:58,842][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:53:59,160][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:53:59,480][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:53:59,799][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:54:00,118][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:54:00,438][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:54:00,757][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:54:01,076][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:54:01,395][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:54:01,713][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:54:02,032][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:54:02,352][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:54:02,671][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:54:02,989][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:54:03,308][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:54:03,627][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:54:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:54:04,264][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:54:04,876][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:54:05,195][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:54:05,513][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:54:05,832][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:54:06,150][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:54:06,469][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:54:06,788][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:54:07,107][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:54:07,427][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:54:07,745][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:54:08,065][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:54:08,384][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:54:08,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:54:09,359][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:54:10,100][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:54:10,102][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:54:10,104][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:54:10,747][__main__][INFO] - Iteration 298 took 27s (12.21% Gen, 85.43% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 15m 27s. Estimated total time: 7h 34m 33s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 27s, 500 more iterations: 3h 47m 16s. [2026-03-25 17:54:10,749][__main__][INFO] - Starting iteration 298. [2026-03-25 17:54:10,752][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:54:10,753][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:54:14,035][__main__][INFO] - Number of regex retries in iteration 298: 0 [2026-03-25 17:54:14,036][__main__][INFO] - agents played in iteration 298 are Alice, Bob [2026-03-25 17:54:14,622][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:54:15,283][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:54:15,573][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:54:15,893][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:54:16,213][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:54:16,533][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:54:16,854][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:54:17,173][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:54:17,492][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:54:17,812][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:54:18,132][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:54:18,451][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:54:18,771][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:54:19,089][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:54:19,409][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:54:19,729][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:54:20,050][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:54:20,369][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:54:20,688][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:54:21,007][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:54:21,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:54:21,646][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:54:21,965][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:54:22,284][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:54:22,603][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:54:22,921][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:54:23,241][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:54:23,561][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:54:23,879][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:54:24,198][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:54:24,516][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:54:24,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:54:25,153][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:54:25,472][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:54:25,791][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:54:26,110][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:54:26,429][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:54:26,748][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:54:27,066][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:54:27,386][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:54:27,705][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:54:28,024][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:54:28,344][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:54:28,663][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:54:28,983][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:54:29,301][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:54:29,620][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:54:29,940][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:54:30,259][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:54:30,578][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:54:30,897][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:54:31,216][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:54:31,535][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:54:32,149][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:54:32,468][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:54:32,787][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:54:33,107][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:54:33,427][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:54:33,746][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:54:34,064][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:54:34,383][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:54:34,703][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:54:35,022][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:54:35,342][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:54:35,662][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:54:35,982][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:54:36,641][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:54:37,592][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:54:37,596][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:54:37,598][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:54:38,219][__main__][INFO] - Iteration 299 took 27s (11.95% Gen, 85.78% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 18m 13s. Estimated total time: 7h 37m 47s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 46s, 500 more iterations: 3h 48m 53s. [2026-03-25 17:54:38,221][__main__][INFO] - Starting iteration 299. [2026-03-25 17:54:38,224][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:54:38,225][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:54:41,584][__main__][INFO] - Number of regex retries in iteration 299: 0 [2026-03-25 17:54:41,585][__main__][INFO] - agents played in iteration 299 are Alice, Bob [2026-03-25 17:54:42,147][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:54:42,799][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:54:43,088][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:54:43,408][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:54:43,728][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:54:44,048][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:54:44,368][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:54:44,687][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:54:45,006][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:54:45,325][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:54:45,643][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:54:45,962][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:54:46,281][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:54:46,600][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:54:46,919][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:54:47,238][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:54:47,556][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:54:47,876][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:54:48,195][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:54:48,514][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:54:48,833][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:54:49,152][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:54:49,470][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:54:49,789][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:54:50,109][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:54:50,428][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:54:50,747][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:54:51,066][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:54:51,385][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:54:51,703][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:54:52,022][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:54:52,341][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:54:52,660][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:54:52,979][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:54:53,298][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:54:53,617][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:54:53,935][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:54:54,254][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:54:54,572][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:54:54,891][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:54:55,210][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:54:55,529][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:54:55,848][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:54:56,166][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:54:56,485][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:54:56,804][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:54:57,123][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:54:57,441][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:54:57,760][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:54:58,079][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:54:58,398][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:54:58,717][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:54:59,035][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:54:59,646][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:54:59,965][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:55:00,284][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:55:00,602][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:55:00,921][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:55:01,239][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:55:01,559][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:55:01,878][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:55:02,196][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:55:02,516][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:55:02,836][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:55:03,155][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:55:03,473][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:55:04,128][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:55:04,866][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:55:04,868][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:55:04,870][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:55:05,510][__main__][INFO] - Iteration 300 took 27s (12.31% Gen, 85.33% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 14m 45s. Estimated total time: 7h 34m 46s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 23s. [2026-03-25 17:55:05,512][__main__][INFO] - Starting iteration 300. [2026-03-25 17:55:05,515][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 5 and human policies 1. [2026-03-25 17:55:05,515][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:55:08,696][__main__][INFO] - Number of regex retries in iteration 300: 0 [2026-03-25 17:55:08,697][__main__][INFO] - agents played in iteration 300 are Alice, Bob [2026-03-25 17:55:09,260][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:55:09,910][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:55:10,200][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:55:10,520][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:55:10,839][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:55:11,157][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:55:11,476][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:55:11,796][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:55:12,114][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:55:12,433][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:55:12,751][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:55:13,070][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:55:13,388][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:55:13,707][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:55:14,027][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:55:14,345][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:55:14,664][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:55:14,983][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:55:15,302][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:55:15,621][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:55:15,941][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:55:16,259][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:55:16,578][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:55:16,897][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:55:17,216][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:55:17,534][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:55:17,852][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:55:18,171][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:55:18,490][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:55:18,809][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:55:19,128][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:55:19,447][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:55:19,767][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:55:20,086][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:55:20,404][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:55:20,722][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:55:21,041][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:55:21,359][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:55:21,678][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:55:21,996][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:55:22,314][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:55:22,632][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:55:22,951][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:55:23,269][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:55:23,588][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:55:23,906][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:55:24,225][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:55:24,545][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:55:24,864][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:55:25,183][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:55:25,502][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:55:25,821][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:55:26,141][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:55:26,753][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:55:27,073][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:55:27,392][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:55:27,713][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:55:28,032][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:55:28,352][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:55:28,672][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:55:28,992][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:55:29,310][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:55:29,630][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:55:29,949][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:55:30,269][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:55:30,589][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:55:31,261][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:55:32,013][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:55:32,016][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:55:32,017][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:55:33,307][__main__][INFO] - Iteration 301 took 27s (11.45% Gen, 83.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 22m 44s. Estimated total time: 7h 43m 13s. Time estimates for 10 more iterations: 4m 37s, 100 more iterations: 46m 19s, 500 more iterations: 3h 51m 36s. [2026-03-25 17:55:33,309][__main__][INFO] - Starting iteration 301. [2026-03-25 17:55:33,314][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:55:33,315][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:55:36,526][__main__][INFO] - Number of regex retries in iteration 301: 0 [2026-03-25 17:55:36,527][__main__][INFO] - agents played in iteration 301 are Alice, Bob [2026-03-25 17:55:37,090][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:55:37,741][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:55:38,030][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:55:38,351][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:55:38,670][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:55:38,990][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:55:39,308][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:55:39,627][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:55:39,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:55:40,266][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:55:40,585][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:55:40,903][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:55:41,222][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:55:41,541][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:55:41,860][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:55:42,179][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:55:42,497][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:55:42,816][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:55:43,134][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:55:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:55:43,771][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:55:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:55:44,407][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:55:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:55:45,045][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:55:45,364][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:55:45,683][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:55:46,002][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:55:46,321][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:55:46,640][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:55:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:55:47,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:55:47,594][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:55:47,913][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:55:48,231][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:55:48,550][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:55:48,869][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:55:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:55:49,507][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:55:49,826][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:55:50,146][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:55:50,465][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:55:50,783][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:55:51,101][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:55:51,421][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:55:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:55:52,059][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:55:52,378][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:55:52,698][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:55:53,017][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:55:53,335][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:55:53,653][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:55:53,971][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:55:54,583][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:55:54,902][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:55:55,221][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:55:55,539][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:55:55,858][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:55:56,176][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:55:56,494][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:55:56,813][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:55:57,132][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:55:57,452][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:55:57,771][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:55:58,089][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:55:58,408][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:55:59,063][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:55:59,786][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:55:59,788][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:55:59,790][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:56:00,493][__main__][INFO] - Iteration 302 took 27s (11.82% Gen, 85.58% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 12m 5s. Estimated total time: 7h 33m 1s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 30s. [2026-03-25 17:56:00,495][__main__][INFO] - Starting iteration 302. [2026-03-25 17:56:00,498][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:56:00,499][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:56:03,646][__main__][INFO] - Number of regex retries in iteration 302: 0 [2026-03-25 17:56:03,647][__main__][INFO] - agents played in iteration 302 are Alice, Bob [2026-03-25 17:56:04,209][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:56:04,864][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:56:05,152][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:56:05,473][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:56:05,794][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:56:06,113][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:56:06,432][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:56:06,751][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:56:07,070][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:56:07,389][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:56:07,708][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:56:08,028][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:56:08,347][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:56:08,666][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:56:08,984][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:56:09,303][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:56:09,622][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:56:09,941][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:56:10,260][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:56:10,579][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:56:10,898][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:56:11,216][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:56:11,535][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:56:11,854][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:56:12,173][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:56:12,492][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:56:12,811][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:56:13,130][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:56:13,449][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:56:13,768][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:56:14,087][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:56:14,406][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:56:14,725][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:56:15,044][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:56:15,363][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:56:15,682][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:56:16,001][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:56:16,319][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:56:16,638][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:56:16,957][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:56:17,276][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:56:17,594][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:56:17,912][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:56:18,231][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:56:18,550][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:56:18,869][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:56:19,188][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:56:19,507][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:56:19,826][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:56:20,145][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:56:20,464][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:56:20,784][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:56:21,103][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:56:21,718][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:56:22,037][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:56:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:56:22,676][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:56:22,995][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:56:23,314][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:56:23,632][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:56:23,952][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:56:24,270][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:56:24,589][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:56:24,908][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:56:25,227][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:56:25,547][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:56:26,206][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:56:26,959][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:56:26,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:56:26,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:56:27,628][__main__][INFO] - Iteration 303 took 27s (11.60% Gen, 85.94% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 10m 47s. Estimated total time: 7h 32m 10s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 5s. [2026-03-25 17:56:27,630][__main__][INFO] - Starting iteration 303. [2026-03-25 17:56:27,633][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:56:27,634][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:56:30,783][__main__][INFO] - Number of regex retries in iteration 303: 0 [2026-03-25 17:56:30,784][__main__][INFO] - agents played in iteration 303 are Alice, Bob [2026-03-25 17:56:31,345][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:56:32,000][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:56:32,289][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:56:32,609][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:56:32,928][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:56:33,247][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:56:33,566][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:56:33,884][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:56:34,203][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:56:34,522][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:56:34,841][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:56:35,160][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:56:35,479][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:56:35,798][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:56:36,116][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:56:36,434][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:56:36,753][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:56:37,073][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:56:37,392][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:56:37,712][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:56:38,031][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:56:38,350][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:56:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:56:38,988][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:56:39,308][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:56:39,627][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:56:39,947][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:56:40,267][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:56:40,587][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:56:40,906][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:56:41,226][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:56:41,545][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:56:41,864][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:56:42,184][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:56:42,503][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:56:42,822][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:56:43,141][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:56:43,461][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:56:43,781][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:56:44,101][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:56:44,420][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:56:44,739][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:56:45,058][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:56:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:56:45,697][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:56:46,016][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:56:46,335][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:56:46,654][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:56:46,973][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:56:47,292][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:56:47,613][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:56:47,932][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:56:48,251][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:56:48,872][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:56:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:56:49,511][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:56:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:56:50,148][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:56:50,467][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:56:50,787][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:56:51,105][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:56:51,424][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:56:51,742][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:56:52,061][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:56:52,378][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:56:52,698][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:56:53,358][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:56:54,089][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:56:54,092][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:56:54,093][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:56:54,730][__main__][INFO] - Iteration 304 took 27s (11.62% Gen, 86.02% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 9m 47s. Estimated total time: 7h 31m 37s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 9s, 500 more iterations: 3h 45m 48s. [2026-03-25 17:56:54,732][__main__][INFO] - Starting iteration 304. [2026-03-25 17:56:54,735][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:56:54,735][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:56:57,919][__main__][INFO] - Number of regex retries in iteration 304: 0 [2026-03-25 17:56:57,920][__main__][INFO] - agents played in iteration 304 are Alice, Bob [2026-03-25 17:56:58,471][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:56:59,132][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:56:59,420][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:56:59,739][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:57:00,057][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:57:00,376][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:57:00,694][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:57:01,012][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:57:01,331][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:57:01,649][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:57:01,968][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:57:02,287][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:57:02,607][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:57:02,926][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:57:03,244][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:57:03,562][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:57:03,881][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:57:04,199][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:57:04,517][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:57:04,836][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:57:05,154][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:57:05,473][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:57:05,792][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:57:06,112][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:57:06,431][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:57:06,750][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:57:07,069][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:57:07,388][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:57:07,707][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:57:08,026][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:57:08,346][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:57:08,665][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:57:08,983][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:57:09,301][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:57:09,621][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:57:09,939][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:57:10,258][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:57:10,577][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:57:10,895][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:57:11,215][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:57:11,533][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:57:11,852][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:57:12,170][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:57:12,490][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:57:12,809][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:57:13,128][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:57:13,446][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:57:13,765][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:57:14,085][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:57:14,403][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:57:14,723][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:57:15,041][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:57:15,359][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:57:15,978][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:57:16,296][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:57:16,615][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:57:16,932][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:57:17,252][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:57:17,570][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:57:17,888][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:57:18,207][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:57:18,526][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:57:18,844][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:57:19,163][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:57:19,481][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:57:19,800][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:57:20,472][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:57:21,219][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:57:21,221][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:57:21,223][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:57:21,921][__main__][INFO] - Iteration 305 took 27s (11.71% Gen, 85.71% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 10m 49s. Estimated total time: 7h 33m 6s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 33s. [2026-03-25 17:57:21,923][__main__][INFO] - Starting iteration 305. [2026-03-25 17:57:21,926][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:57:21,926][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:57:25,134][__main__][INFO] - Number of regex retries in iteration 305: 0 [2026-03-25 17:57:25,135][__main__][INFO] - agents played in iteration 305 are Alice, Bob [2026-03-25 17:57:25,704][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:57:26,366][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:57:26,655][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:57:26,976][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:57:27,295][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:57:27,614][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:57:27,933][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:57:28,253][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:57:28,572][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:57:28,892][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:57:29,210][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:57:29,529][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:57:29,848][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:57:30,168][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:57:30,487][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:57:30,807][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:57:31,126][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:57:31,446][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:57:31,765][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:57:32,084][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:57:32,404][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:57:32,723][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:57:33,042][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:57:33,361][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:57:33,680][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:57:33,999][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:57:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:57:34,636][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:57:34,955][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:57:35,275][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:57:35,594][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:57:35,914][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:57:36,234][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:57:36,553][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:57:36,872][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:57:37,191][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:57:37,510][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:57:37,830][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:57:38,149][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:57:38,468][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:57:38,787][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:57:39,106][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:57:39,425][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:57:39,744][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:57:40,063][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:57:40,382][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:57:40,701][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:57:41,019][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:57:41,338][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:57:41,657][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:57:41,975][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:57:42,295][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:57:42,613][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:57:43,231][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:57:43,549][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:57:43,868][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:57:44,188][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:57:44,507][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:57:44,825][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:57:45,145][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:57:45,463][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:57:45,783][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:57:46,102][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:57:46,421][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:57:46,741][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:57:47,060][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:57:47,725][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:57:48,464][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:57:48,467][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:57:48,468][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:57:49,112][__main__][INFO] - Iteration 306 took 27s (11.80% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 10m 22s. Estimated total time: 7h 33m 7s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 33s. [2026-03-25 17:57:49,114][__main__][INFO] - Starting iteration 306. [2026-03-25 17:57:49,117][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:57:49,118][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:57:52,317][__main__][INFO] - Number of regex retries in iteration 306: 0 [2026-03-25 17:57:52,318][__main__][INFO] - agents played in iteration 306 are Alice, Bob [2026-03-25 17:57:52,889][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:57:53,552][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:57:53,841][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:57:54,161][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:57:54,480][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:57:54,800][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:57:55,119][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:57:55,438][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:57:55,758][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:57:56,077][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:57:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:57:56,716][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:57:57,035][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:57:57,354][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:57:57,673][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:57:57,993][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:57:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:57:58,633][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:57:58,953][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:57:59,274][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:57:59,593][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:57:59,913][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:58:00,233][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:58:00,553][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:58:00,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:58:01,192][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:58:01,512][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:58:01,832][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:58:02,152][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:58:02,474][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:58:02,794][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:58:03,113][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:58:03,433][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:58:03,753][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:58:04,071][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:58:04,390][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:58:04,708][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:58:05,027][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:58:05,346][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:58:05,665][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:58:05,984][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:58:06,303][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:58:06,622][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:58:06,941][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:58:07,260][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:58:07,579][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:58:07,898][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:58:08,217][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:58:08,536][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:58:08,855][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:58:09,175][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:58:09,494][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:58:09,813][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:58:10,429][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:58:10,749][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:58:11,068][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:58:11,387][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:58:11,707][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:58:12,026][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:58:12,346][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:58:12,664][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:58:12,983][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:58:13,303][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:58:13,621][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:58:13,939][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:58:14,258][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:58:14,922][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:58:15,668][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:58:15,670][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:58:15,672][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:58:16,315][__main__][INFO] - Iteration 307 took 27s (11.77% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 10m 6s. Estimated total time: 7h 33m 18s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 39s. [2026-03-25 17:58:16,317][__main__][INFO] - Starting iteration 307. [2026-03-25 17:58:16,320][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:58:16,321][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:58:19,527][__main__][INFO] - Number of regex retries in iteration 307: 0 [2026-03-25 17:58:19,528][__main__][INFO] - agents played in iteration 307 are Alice, Bob [2026-03-25 17:58:20,086][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:58:20,747][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:58:21,036][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:58:21,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:58:21,673][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:58:21,991][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:58:22,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:58:22,630][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:58:22,949][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:58:23,268][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:58:23,587][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:58:23,906][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:58:24,224][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:58:24,543][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:58:24,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:58:25,181][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:58:25,501][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:58:25,820][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:58:26,139][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:58:26,458][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:58:26,776][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:58:27,096][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:58:27,415][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:58:27,735][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:58:28,054][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:58:28,373][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:58:28,691][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:58:29,011][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:58:29,330][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:58:29,649][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:58:29,968][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:58:30,286][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:58:30,606][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:58:30,925][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:58:31,244][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:58:31,563][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:58:31,882][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:58:32,201][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:58:32,522][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:58:32,841][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:58:33,159][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:58:33,479][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:58:33,799][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:58:34,118][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:58:34,437][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:58:34,756][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:58:35,076][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:58:35,395][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:58:35,714][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:58:36,034][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:58:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:58:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:58:36,991][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:58:37,609][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:58:37,929][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:58:38,249][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:58:38,568][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:58:38,889][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:58:39,207][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:58:39,526][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:58:39,846][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:58:40,165][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:58:40,484][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:58:40,802][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:58:41,121][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:58:41,441][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:58:42,109][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:58:42,844][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:58:42,846][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:58:42,848][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:58:43,500][__main__][INFO] - Iteration 308 took 27s (11.80% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 9m 21s. Estimated total time: 7h 33m 0s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 30s. [2026-03-25 17:58:43,502][__main__][INFO] - Starting iteration 308. [2026-03-25 17:58:43,505][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:58:43,506][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:58:46,745][__main__][INFO] - Number of regex retries in iteration 308: 0 [2026-03-25 17:58:46,746][__main__][INFO] - agents played in iteration 308 are Alice, Bob [2026-03-25 17:58:47,303][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:58:47,983][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:58:48,273][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:58:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:58:48,913][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:58:49,232][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:58:49,551][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:58:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:58:50,190][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:58:50,509][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:58:50,829][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:58:51,149][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:58:51,469][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:58:51,788][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:58:52,107][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:58:52,427][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:58:52,746][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:58:53,066][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:58:53,385][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:58:53,705][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:58:54,024][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:58:54,342][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:58:54,661][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:58:54,980][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:58:55,298][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:58:55,618][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:58:55,937][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:58:56,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:58:56,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:58:56,895][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:58:57,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:58:57,534][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:58:57,853][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:58:58,173][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:58:58,492][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:58:58,812][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:58:59,132][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:58:59,452][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:58:59,770][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:59:00,089][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:59:00,409][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:59:00,728][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:59:01,047][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:59:01,366][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:59:01,686][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:59:02,003][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:59:02,322][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:59:02,641][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:59:02,960][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:59:03,278][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:59:03,598][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:59:03,916][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:59:04,235][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:59:04,884][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:59:05,202][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:59:05,521][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:59:05,842][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:59:06,160][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:59:06,479][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:59:06,797][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:59:07,116][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:59:07,435][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:59:07,754][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:59:08,073][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:59:08,392][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:59:08,712][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:59:09,380][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:59:10,129][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:59:10,131][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:59:10,132][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:59:10,843][__main__][INFO] - Iteration 309 took 27s (11.85% Gen, 85.54% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 11m 32s. Estimated total time: 7h 35m 38s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 33s, 500 more iterations: 3h 47m 49s. [2026-03-25 17:59:10,845][__main__][INFO] - Starting iteration 309. [2026-03-25 17:59:10,848][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:59:10,849][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:59:14,053][__main__][INFO] - Number of regex retries in iteration 309: 0 [2026-03-25 17:59:14,054][__main__][INFO] - agents played in iteration 309 are Alice, Bob [2026-03-25 17:59:14,609][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:59:15,272][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:59:15,562][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:59:15,882][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:59:16,202][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:59:16,523][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:59:16,842][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:59:17,161][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:59:17,479][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:59:17,798][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:59:18,117][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:59:18,435][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:59:18,753][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:59:19,072][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:59:19,391][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:59:19,710][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:59:20,030][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:59:20,349][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:59:20,668][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:59:20,987][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:59:21,306][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:59:21,625][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:59:21,944][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:59:22,262][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:59:22,582][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:59:22,901][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:59:23,220][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:59:23,538][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:59:23,857][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:59:24,175][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:59:24,493][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:59:24,812][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:59:25,130][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:59:25,450][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:59:25,770][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:59:26,089][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:59:26,408][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:59:26,726][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:59:27,045][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:59:27,364][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:59:27,682][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:59:28,000][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:59:28,319][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:59:28,637][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:59:28,955][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:59:29,273][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:59:29,591][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:59:29,910][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:59:30,229][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:59:30,549][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:59:30,868][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:59:31,186][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:59:31,505][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:59:32,133][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:59:32,453][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 17:59:32,771][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 17:59:33,090][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 17:59:33,409][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 17:59:33,728][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 17:59:34,047][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 17:59:34,365][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 17:59:34,683][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 17:59:35,001][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 17:59:35,320][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 17:59:35,639][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 17:59:35,958][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 17:59:36,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 17:59:37,365][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 17:59:37,367][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 17:59:37,368][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 17:59:38,076][__main__][INFO] - Iteration 310 took 27s (11.77% Gen, 85.62% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 9m 15s. Estimated total time: 7h 33m 49s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 22s, 500 more iterations: 3h 46m 54s. [2026-03-25 17:59:38,079][__main__][INFO] - Starting iteration 310. [2026-03-25 17:59:38,082][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 17:59:38,082][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 17:59:41,308][__main__][INFO] - Number of regex retries in iteration 310: 0 [2026-03-25 17:59:41,309][__main__][INFO] - agents played in iteration 310 are Alice, Bob [2026-03-25 17:59:41,863][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 17:59:42,524][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 17:59:42,813][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 17:59:43,133][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 17:59:43,452][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 17:59:43,770][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 17:59:44,089][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 17:59:44,407][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 17:59:44,726][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 17:59:45,044][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 17:59:45,363][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 17:59:45,683][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 17:59:46,002][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 17:59:46,321][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 17:59:46,639][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 17:59:46,958][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 17:59:47,277][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 17:59:47,596][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 17:59:47,915][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 17:59:48,235][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 17:59:48,553][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 17:59:48,872][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 17:59:49,191][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 17:59:49,511][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 17:59:49,830][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 17:59:50,150][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 17:59:50,470][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 17:59:50,789][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 17:59:51,108][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 17:59:51,428][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 17:59:51,748][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 17:59:52,067][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 17:59:52,387][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 17:59:52,706][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 17:59:53,025][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 17:59:53,344][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 17:59:53,663][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 17:59:53,981][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 17:59:54,301][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 17:59:54,620][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 17:59:54,939][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 17:59:55,259][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 17:59:55,578][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 17:59:55,896][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 17:59:56,215][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 17:59:56,533][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 17:59:56,851][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 17:59:57,171][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 17:59:57,490][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 17:59:57,810][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 17:59:58,129][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 17:59:58,447][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 17:59:58,767][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 17:59:59,396][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 17:59:59,715][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:00:00,034][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:00:00,353][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:00:00,672][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:00:00,990][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:00:01,308][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:00:01,627][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:00:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:00:02,266][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:00:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:00:02,902][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:00:03,221][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:00:03,885][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:00:04,632][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:00:04,634][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:00:04,636][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:00:05,303][__main__][INFO] - Iteration 311 took 27s (11.85% Gen, 85.69% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 8m 41s. Estimated total time: 7h 33m 42s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 22s, 500 more iterations: 3h 46m 51s. [2026-03-25 18:00:05,305][__main__][INFO] - Starting iteration 311. [2026-03-25 18:00:05,308][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:00:05,309][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:00:08,512][__main__][INFO] - Number of regex retries in iteration 311: 0 [2026-03-25 18:00:08,513][__main__][INFO] - agents played in iteration 311 are Alice, Bob [2026-03-25 18:00:09,066][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:00:09,727][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:00:10,015][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:00:10,335][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:00:10,653][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:00:10,972][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:00:11,291][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:00:11,609][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:00:11,928][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:00:12,247][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:00:12,566][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:00:12,885][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:00:13,204][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:00:13,523][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:00:13,842][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:00:14,160][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:00:14,478][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:00:14,797][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:00:15,116][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:00:15,434][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:00:15,752][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:00:16,071][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:00:16,390][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:00:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:00:17,026][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:00:17,346][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:00:17,665][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:00:17,984][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:00:18,302][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:00:18,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:00:18,939][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:00:19,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:00:19,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:00:19,895][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:00:20,213][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:00:20,533][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:00:20,851][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:00:21,170][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:00:21,489][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:00:21,809][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:00:22,128][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:00:22,448][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:00:22,770][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:00:23,089][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:00:23,408][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:00:23,727][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:00:24,046][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:00:24,365][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:00:24,684][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:00:25,002][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:00:25,321][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:00:25,641][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:00:25,960][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:00:26,596][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:00:26,916][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:00:27,236][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:00:27,555][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:00:27,875][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:00:28,195][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:00:28,514][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:00:28,835][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:00:29,154][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:00:29,474][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:00:29,793][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:00:30,113][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:00:30,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:00:31,102][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:00:31,850][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:00:31,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:00:31,854][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:00:32,574][__main__][INFO] - Iteration 312 took 27s (11.75% Gen, 85.60% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 8m 59s. Estimated total time: 7h 34m 27s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 13s. [2026-03-25 18:00:32,577][__main__][INFO] - Starting iteration 312. [2026-03-25 18:00:32,579][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:00:32,580][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:00:35,814][__main__][INFO] - Number of regex retries in iteration 312: 0 [2026-03-25 18:00:35,815][__main__][INFO] - agents played in iteration 312 are Alice, Bob [2026-03-25 18:00:36,364][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:00:37,024][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:00:37,312][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:00:37,632][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:00:37,950][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:00:38,269][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:00:38,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:00:38,910][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:00:39,229][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:00:39,548][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:00:39,867][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:00:40,186][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:00:40,505][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:00:40,824][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:00:41,143][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:00:41,463][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:00:41,781][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:00:42,099][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:00:42,418][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:00:42,736][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:00:43,055][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:00:43,374][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:00:43,693][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:00:44,011][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:00:44,330][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:00:44,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:00:44,968][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:00:45,287][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:00:45,607][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:00:45,925][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:00:46,244][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:00:46,563][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:00:46,881][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:00:47,199][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:00:47,519][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:00:47,837][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:00:48,154][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:00:48,473][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:00:48,792][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:00:49,111][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:00:49,430][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:00:49,749][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:00:50,068][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:00:50,385][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:00:50,704][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:00:51,022][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:00:51,340][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:00:51,659][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:00:51,978][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:00:52,296][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:00:52,615][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:00:52,933][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:00:53,251][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:00:53,879][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:00:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:00:54,516][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:00:54,834][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:00:55,152][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:00:55,470][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:00:55,788][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:00:56,107][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:00:56,425][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:00:56,743][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:00:57,062][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:00:57,381][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:00:57,700][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:00:58,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:00:59,102][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:00:59,105][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:00:59,106][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:00:59,763][__main__][INFO] - Iteration 313 took 27s (11.90% Gen, 85.68% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 7m 9s. Estimated total time: 7h 33m 4s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 32s. [2026-03-25 18:00:59,765][__main__][INFO] - Starting iteration 313. [2026-03-25 18:00:59,768][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:00:59,769][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:01:02,990][__main__][INFO] - Number of regex retries in iteration 313: 0 [2026-03-25 18:01:02,990][__main__][INFO] - agents played in iteration 313 are Alice, Bob [2026-03-25 18:01:03,533][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:01:04,190][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:01:04,480][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:01:04,799][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:01:05,118][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:01:05,437][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:01:05,756][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:01:06,074][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:01:06,392][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:01:06,710][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:01:07,029][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:01:07,348][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:01:07,667][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:01:07,987][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:01:08,307][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:01:08,625][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:01:08,944][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:01:09,262][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:01:09,581][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:01:09,899][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:01:10,217][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:01:10,536][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:01:10,854][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:01:11,173][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:01:11,491][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:01:11,810][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:01:12,129][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:01:12,448][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:01:12,766][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:01:13,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:01:13,402][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:01:13,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:01:14,040][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:01:14,359][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:01:14,678][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:01:14,997][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:01:15,315][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:01:15,634][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:01:15,953][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:01:16,272][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:01:16,591][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:01:16,909][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:01:17,229][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:01:17,548][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:01:17,867][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:01:18,186][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:01:18,504][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:01:18,822][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:01:19,141][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:01:19,460][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:01:19,778][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:01:20,096][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:01:20,414][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:01:21,040][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:01:21,359][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:01:21,679][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:01:21,997][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:01:22,316][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:01:22,635][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:01:22,954][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:01:23,272][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:01:23,590][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:01:23,909][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:01:24,228][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:01:24,547][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:01:24,866][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:01:25,529][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:01:26,308][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:01:26,310][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:01:26,311][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:01:27,176][__main__][INFO] - Iteration 314 took 27s (11.76% Gen, 85.08% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 10m 26s. Estimated total time: 7h 36m 48s. Time estimates for 10 more iterations: 4m 34s, 100 more iterations: 45m 40s, 500 more iterations: 3h 48m 24s. [2026-03-25 18:01:27,178][__main__][INFO] - Starting iteration 314. [2026-03-25 18:01:27,181][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:01:27,182][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:01:30,392][__main__][INFO] - Number of regex retries in iteration 314: 0 [2026-03-25 18:01:30,393][__main__][INFO] - agents played in iteration 314 are Alice, Bob [2026-03-25 18:01:30,935][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:01:31,595][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:01:31,884][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:01:32,204][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:01:32,524][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:01:32,842][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:01:33,161][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:01:33,480][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:01:33,798][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:01:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:01:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:01:34,753][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:01:35,073][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:01:35,392][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:01:35,710][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:01:36,029][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:01:36,348][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:01:36,667][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:01:36,986][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:01:37,305][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:01:37,625][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:01:37,945][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:01:38,264][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:01:38,583][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:01:38,902][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:01:39,221][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:01:39,540][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:01:39,859][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:01:40,178][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:01:40,498][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:01:40,818][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:01:41,139][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:01:41,460][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:01:41,779][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:01:42,098][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:01:42,418][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:01:42,738][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:01:43,057][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:01:43,376][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:01:43,695][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:01:44,015][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:01:44,335][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:01:44,654][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:01:44,973][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:01:45,293][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:01:45,612][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:01:45,931][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:01:46,251][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:01:46,572][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:01:46,893][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:01:47,212][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:01:47,530][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:01:47,849][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:01:48,476][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:01:48,794][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:01:49,113][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:01:49,433][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:01:49,751][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:01:50,070][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:01:50,389][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:01:50,709][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:01:51,027][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:01:51,345][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:01:51,664][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:01:51,983][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:01:52,302][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:01:52,965][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:01:53,713][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:01:53,715][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:01:53,717][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:01:54,359][__main__][INFO] - Iteration 315 took 27s (11.81% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 6m 9s. Estimated total time: 7h 32m 59s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 29s. [2026-03-25 18:01:54,361][__main__][INFO] - Starting iteration 315. [2026-03-25 18:01:54,364][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:01:54,365][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:01:57,568][__main__][INFO] - Number of regex retries in iteration 315: 0 [2026-03-25 18:01:57,569][__main__][INFO] - agents played in iteration 315 are Alice, Bob [2026-03-25 18:01:58,115][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:01:58,773][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:01:59,061][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:01:59,381][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:01:59,700][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:02:00,019][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:02:00,337][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:02:00,655][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:02:00,974][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:02:01,293][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:02:01,611][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:02:01,931][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:02:02,250][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:02:02,569][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:02:02,888][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:02:03,208][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:02:03,527][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:02:03,846][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:02:04,165][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:02:04,485][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:02:04,804][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:02:05,123][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:02:05,442][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:02:05,761][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:02:06,080][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:02:06,398][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:02:06,717][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:02:07,036][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:02:07,355][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:02:07,673][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:02:07,993][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:02:08,312][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:02:08,631][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:02:08,950][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:02:09,269][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:02:09,588][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:02:09,906][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:02:10,224][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:02:10,545][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:02:10,864][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:02:11,183][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:02:11,502][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:02:11,821][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:02:12,140][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:02:12,458][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:02:12,777][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:02:13,097][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:02:13,415][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:02:13,734][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:02:14,053][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:02:14,372][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:02:14,691][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:02:15,010][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:02:15,627][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:02:15,945][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:02:16,264][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:02:16,582][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:02:16,900][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:02:17,219][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:02:17,537][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:02:17,857][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:02:18,176][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:02:18,495][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:02:18,815][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:02:19,134][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:02:19,453][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:02:20,116][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:02:20,851][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:02:20,853][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:02:20,855][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:02:21,499][__main__][INFO] - Iteration 316 took 27s (11.81% Gen, 85.81% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 4m 58s. Estimated total time: 7h 32m 15s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 7s. [2026-03-25 18:02:21,501][__main__][INFO] - Starting iteration 316. [2026-03-25 18:02:21,504][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:02:21,505][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:02:24,852][__main__][INFO] - Number of regex retries in iteration 316: 0 [2026-03-25 18:02:24,853][__main__][INFO] - agents played in iteration 316 are Alice, Bob [2026-03-25 18:02:25,407][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:02:26,066][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:02:26,356][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:02:26,675][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:02:26,995][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:02:27,315][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:02:27,634][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:02:27,953][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:02:28,272][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:02:28,592][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:02:28,912][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:02:29,231][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:02:29,550][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:02:29,869][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:02:30,188][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:02:30,506][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:02:30,824][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:02:31,144][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:02:31,463][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:02:31,782][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:02:32,101][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:02:32,420][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:02:32,738][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:02:33,057][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:02:33,376][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:02:33,694][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:02:34,013][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:02:34,332][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:02:34,651][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:02:34,969][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:02:35,288][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:02:35,607][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:02:35,926][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:02:36,245][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:02:36,564][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:02:36,883][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:02:37,203][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:02:37,521][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:02:37,840][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:02:38,159][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:02:38,477][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:02:38,795][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:02:39,114][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:02:39,432][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:02:39,750][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:02:40,069][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:02:40,388][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:02:40,707][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:02:41,025][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:02:41,344][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:02:41,662][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:02:41,981][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:02:42,300][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:02:42,915][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:02:43,234][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:02:43,554][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:02:43,872][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:02:44,191][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:02:44,509][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:02:44,829][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:02:45,147][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:02:45,466][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:02:45,785][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:02:46,104][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:02:46,423][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:02:46,741][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:02:47,404][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:02:48,152][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:02:48,154][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:02:48,156][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:02:48,800][__main__][INFO] - Iteration 317 took 27s (12.27% Gen, 85.37% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 7m 12s. Estimated total time: 7h 34m 56s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 29s, 500 more iterations: 3h 47m 28s. [2026-03-25 18:02:48,802][__main__][INFO] - Starting iteration 317. [2026-03-25 18:02:48,805][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:02:48,806][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:02:52,014][__main__][INFO] - Number of regex retries in iteration 317: 0 [2026-03-25 18:02:52,015][__main__][INFO] - agents played in iteration 317 are Alice, Bob [2026-03-25 18:02:52,566][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:02:53,225][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:02:53,514][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:02:53,835][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:02:54,156][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:02:54,476][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:02:54,796][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:02:55,116][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:02:55,436][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:02:55,756][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:02:56,075][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:02:56,396][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:02:56,715][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:02:57,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:02:57,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:02:57,673][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:02:57,992][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:02:58,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:02:58,632][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:02:58,952][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:02:59,271][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:02:59,591][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:02:59,911][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:03:00,230][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:03:00,551][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:03:00,871][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:03:01,191][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:03:01,511][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:03:01,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:03:02,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:03:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:03:02,786][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:03:03,105][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:03:03,424][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:03:03,741][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:03:04,061][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:03:04,380][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:03:04,698][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:03:05,016][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:03:05,336][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:03:05,655][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:03:05,974][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:03:06,293][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:03:06,612][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:03:06,931][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:03:07,251][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:03:07,570][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:03:07,889][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:03:08,209][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:03:08,528][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:03:08,847][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:03:09,165][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:03:09,484][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:03:10,100][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:03:10,419][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:03:10,738][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:03:11,057][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:03:11,376][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:03:11,694][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:03:12,013][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:03:12,332][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:03:12,651][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:03:12,970][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:03:13,290][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:03:13,609][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:03:13,928][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:03:14,591][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:03:15,327][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:03:15,329][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:03:15,330][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:03:15,971][__main__][INFO] - Iteration 318 took 27s (11.82% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 4m 35s. Estimated total time: 7h 32m 47s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 16s, 500 more iterations: 3h 46m 23s. [2026-03-25 18:03:15,973][__main__][INFO] - Starting iteration 318. [2026-03-25 18:03:15,976][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:03:15,977][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:03:19,167][__main__][INFO] - Number of regex retries in iteration 318: 0 [2026-03-25 18:03:19,167][__main__][INFO] - agents played in iteration 318 are Alice, Bob [2026-03-25 18:03:19,702][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:03:20,360][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:03:20,649][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:03:20,969][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:03:21,288][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:03:21,606][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:03:21,924][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:03:22,243][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:03:22,561][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:03:22,881][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:03:23,200][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:03:23,519][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:03:23,836][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:03:24,155][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:03:24,474][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:03:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:03:25,110][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:03:25,429][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:03:25,748][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:03:26,067][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:03:26,385][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:03:26,704][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:03:27,022][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:03:27,341][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:03:27,660][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:03:27,979][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:03:28,297][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:03:28,616][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:03:28,934][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:03:29,253][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:03:29,571][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:03:29,891][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:03:30,210][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:03:30,529][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:03:30,848][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:03:31,167][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:03:31,486][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:03:31,805][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:03:32,124][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:03:32,443][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:03:32,761][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:03:33,080][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:03:33,398][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:03:33,716][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:03:34,036][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:03:34,354][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:03:34,673][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:03:34,993][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:03:35,313][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:03:35,632][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:03:35,951][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:03:36,270][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:03:36,590][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:03:37,205][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:03:37,524][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:03:37,843][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:03:38,163][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:03:38,482][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:03:38,804][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:03:39,123][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:03:39,441][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:03:39,760][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:03:40,078][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:03:40,396][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:03:40,714][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:03:41,033][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:03:41,701][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:03:42,476][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:03:42,478][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:03:42,480][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:03:43,123][__main__][INFO] - Iteration 319 took 27s (11.75% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 3m 49s. Estimated total time: 7h 32m 27s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 14s, 500 more iterations: 3h 46m 13s. [2026-03-25 18:03:43,125][__main__][INFO] - Starting iteration 319. [2026-03-25 18:03:43,128][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:03:43,129][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:03:46,459][__main__][INFO] - Number of regex retries in iteration 319: 0 [2026-03-25 18:03:46,460][__main__][INFO] - agents played in iteration 319 are Alice, Bob [2026-03-25 18:03:46,986][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:03:47,644][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:03:47,934][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:03:48,253][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:03:48,571][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:03:48,889][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:03:49,208][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:03:49,527][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:03:49,846][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:03:50,164][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:03:50,483][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:03:50,802][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:03:51,120][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:03:51,439][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:03:51,757][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:03:52,075][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:03:52,393][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:03:52,712][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:03:53,031][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:03:53,351][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:03:53,670][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:03:53,989][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:03:54,308][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:03:54,627][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:03:54,946][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:03:55,264][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:03:55,583][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:03:55,902][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:03:56,221][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:03:56,540][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:03:56,859][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:03:57,177][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:03:57,495][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:03:57,814][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:03:58,132][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:03:58,450][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:03:58,770][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:03:59,088][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:03:59,407][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:03:59,726][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:04:00,045][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:04:00,364][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:04:00,683][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:04:01,002][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:04:01,321][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:04:01,640][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:04:01,959][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:04:02,278][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:04:02,597][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:04:02,915][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:04:03,234][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:04:03,553][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:04:03,871][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:04:04,485][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:04:04,804][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:04:05,124][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:04:05,444][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:04:05,763][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:04:06,084][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:04:06,403][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:04:06,722][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:04:07,042][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:04:07,362][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:04:07,681][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:04:08,000][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:04:08,321][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:04:08,989][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:04:09,731][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:04:09,733][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:04:09,735][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:04:10,373][__main__][INFO] - Iteration 320 took 27s (12.23% Gen, 85.42% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 5m 0s. Estimated total time: 7h 34m 6s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 24s, 500 more iterations: 3h 47m 3s. [2026-03-25 18:04:10,375][__main__][INFO] - Starting iteration 320. [2026-03-25 18:04:10,378][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:04:10,379][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:04:13,575][__main__][INFO] - Number of regex retries in iteration 320: 0 [2026-03-25 18:04:13,576][__main__][INFO] - agents played in iteration 320 are Alice, Bob [2026-03-25 18:04:14,103][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:04:14,759][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:04:15,049][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:04:15,371][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:04:15,691][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:04:16,010][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:04:16,329][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:04:16,648][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:04:16,967][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:04:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:04:17,603][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:04:17,921][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:04:18,240][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:04:18,559][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:04:18,878][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:04:19,197][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:04:19,516][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:04:19,835][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:04:20,153][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:04:20,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:04:20,791][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:04:21,109][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:04:21,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:04:21,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:04:22,066][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:04:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:04:22,703][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:04:23,022][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:04:23,341][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:04:23,660][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:04:23,979][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:04:24,297][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:04:24,615][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:04:24,934][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:04:25,252][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:04:25,571][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:04:25,890][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:04:26,209][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:04:26,529][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:04:26,848][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:04:27,167][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:04:27,486][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:04:27,805][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:04:28,123][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:04:28,442][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:04:28,761][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:04:29,080][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:04:29,398][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:04:29,717][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:04:30,036][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:04:30,355][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:04:30,673][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:04:30,992][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:04:31,606][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:04:31,925][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:04:32,244][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:04:32,562][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:04:32,881][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:04:33,201][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:04:33,519][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:04:33,839][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:04:34,158][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:04:34,476][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:04:34,795][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:04:35,115][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:04:35,434][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:04:36,093][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:04:36,835][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:04:36,837][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:04:36,839][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:04:37,476][__main__][INFO] - Iteration 321 took 27s (11.80% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 2m 5s. Estimated total time: 7h 31m 38s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 9s, 500 more iterations: 3h 45m 49s. [2026-03-25 18:04:37,478][__main__][INFO] - Starting iteration 321. [2026-03-25 18:04:37,481][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:04:37,482][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:04:40,694][__main__][INFO] - Number of regex retries in iteration 321: 0 [2026-03-25 18:04:40,695][__main__][INFO] - agents played in iteration 321 are Alice, Bob [2026-03-25 18:04:41,218][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:04:41,874][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:04:42,163][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:04:42,482][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:04:42,800][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:04:43,119][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:04:43,438][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:04:43,756][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:04:44,075][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:04:44,394][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:04:44,713][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:04:45,032][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:04:45,351][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:04:45,670][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:04:45,989][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:04:46,308][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:04:46,627][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:04:46,945][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:04:47,264][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:04:47,582][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:04:47,901][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:04:48,220][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:04:48,538][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:04:48,857][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:04:49,176][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:04:49,494][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:04:49,813][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:04:50,133][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:04:50,451][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:04:50,770][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:04:51,088][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:04:51,407][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:04:51,727][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:04:52,046][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:04:52,364][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:04:52,683][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:04:53,003][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:04:53,321][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:04:53,639][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:04:53,958][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:04:54,276][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:04:54,595][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:04:54,913][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:04:55,232][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:04:55,551][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:04:55,870][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:04:56,189][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:04:56,508][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:04:56,827][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:04:57,147][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:04:57,466][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:04:57,785][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:04:58,103][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:04:58,719][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:04:59,037][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:04:59,355][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:04:59,674][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:04:59,993][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:05:00,312][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:05:00,631][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:05:00,951][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:05:01,270][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:05:01,589][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:05:01,908][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:05:02,226][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:05:02,546][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:05:03,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:05:03,944][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:05:03,946][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:05:03,947][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:05:04,593][__main__][INFO] - Iteration 322 took 27s (11.85% Gen, 85.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 1m 52s. Estimated total time: 7h 31m 52s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 11s, 500 more iterations: 3h 45m 56s. [2026-03-25 18:05:04,595][__main__][INFO] - Starting iteration 322. [2026-03-25 18:05:04,598][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:05:04,599][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:05:07,812][__main__][INFO] - Number of regex retries in iteration 322: 0 [2026-03-25 18:05:07,813][__main__][INFO] - agents played in iteration 322 are Alice, Bob [2026-03-25 18:05:08,338][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:05:08,995][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:05:09,283][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:05:09,602][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:05:09,921][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:05:10,239][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:05:10,557][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:05:10,876][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:05:11,195][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:05:11,514][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:05:11,832][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:05:12,151][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:05:12,471][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:05:12,790][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:05:13,110][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:05:13,430][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:05:13,749][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:05:14,069][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:05:14,388][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:05:14,706][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:05:15,025][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:05:15,343][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:05:15,662][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:05:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:05:16,300][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:05:16,618][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:05:16,938][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:05:17,257][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:05:17,576][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:05:17,896][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:05:18,215][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:05:18,534][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:05:18,852][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:05:19,171][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:05:19,490][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:05:19,811][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:05:20,131][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:05:20,450][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:05:20,770][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:05:21,090][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:05:21,410][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:05:21,730][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:05:22,048][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:05:22,368][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:05:22,688][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:05:23,008][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:05:23,327][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:05:23,647][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:05:23,966][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:05:24,285][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:05:24,606][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:05:24,925][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:05:25,245][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:05:25,866][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:05:26,186][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:05:26,506][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:05:26,826][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:05:27,147][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:05:27,466][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:05:27,785][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:05:28,104][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:05:28,422][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:05:28,743][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:05:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:05:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:05:29,703][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:05:30,361][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:05:31,113][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:05:31,115][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:05:31,116][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:05:31,759][__main__][INFO] - Iteration 323 took 27s (11.83% Gen, 85.79% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 2m 14s. Estimated total time: 7h 32m 42s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 16s, 500 more iterations: 3h 46m 21s. [2026-03-25 18:05:31,761][__main__][INFO] - Starting iteration 323. [2026-03-25 18:05:31,764][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:05:31,765][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:05:35,002][__main__][INFO] - Number of regex retries in iteration 323: 0 [2026-03-25 18:05:35,003][__main__][INFO] - agents played in iteration 323 are Alice, Bob [2026-03-25 18:05:35,530][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:05:36,187][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:05:36,475][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:05:36,794][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:05:37,112][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:05:37,431][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:05:37,751][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:05:38,069][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:05:38,388][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:05:38,706][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:05:39,025][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:05:39,344][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:05:39,662][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:05:39,980][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:05:40,299][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:05:40,619][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:05:40,937][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:05:41,256][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:05:41,575][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:05:41,894][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:05:42,213][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:05:42,533][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:05:42,851][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:05:43,170][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:05:43,490][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:05:43,809][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:05:44,128][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:05:44,447][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:05:44,765][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:05:45,084][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:05:45,403][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:05:45,722][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:05:46,041][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:05:46,360][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:05:46,679][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:05:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:05:47,316][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:05:47,635][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:05:47,954][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:05:48,272][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:05:48,591][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:05:48,910][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:05:49,230][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:05:49,549][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:05:49,868][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:05:50,187][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:05:50,506][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:05:50,824][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:05:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:05:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:05:51,781][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:05:52,101][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:05:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:05:53,035][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:05:53,354][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:05:53,673][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:05:53,992][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:05:54,310][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:05:54,629][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:05:54,948][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:05:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:05:55,586][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:05:55,905][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:05:56,224][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:05:56,543][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:05:56,863][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:05:57,524][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:05:58,267][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:05:58,269][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:05:58,271][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:05:58,909][__main__][INFO] - Iteration 324 took 27s (11.93% Gen, 85.71% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 1m 31s. Estimated total time: 7h 32m 26s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 14s, 500 more iterations: 3h 46m 13s. [2026-03-25 18:05:58,912][__main__][INFO] - Starting iteration 324. [2026-03-25 18:05:58,915][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:05:58,915][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:06:02,107][__main__][INFO] - Number of regex retries in iteration 324: 0 [2026-03-25 18:06:02,108][__main__][INFO] - agents played in iteration 324 are Alice, Bob [2026-03-25 18:06:02,642][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:06:03,300][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:06:03,589][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:06:03,909][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:06:04,228][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:06:04,547][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:06:04,865][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:06:05,183][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:06:05,503][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:06:05,821][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:06:06,139][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:06:06,457][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:06:06,775][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:06:07,094][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:06:07,413][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:06:07,732][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:06:08,051][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:06:08,370][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:06:08,689][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:06:09,008][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:06:09,327][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:06:09,646][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:06:09,965][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:06:10,283][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:06:10,602][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:06:10,920][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:06:11,239][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:06:11,559][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:06:11,878][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:06:12,197][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:06:12,516][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:06:12,836][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:06:13,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:06:13,474][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:06:13,793][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:06:14,111][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:06:14,430][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:06:14,750][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:06:15,069][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:06:15,388][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:06:15,709][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:06:16,028][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:06:16,346][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:06:16,665][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:06:16,985][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:06:17,305][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:06:17,624][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:06:17,943][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:06:18,262][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:06:18,581][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:06:18,900][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:06:19,218][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:06:19,537][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:06:20,151][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:06:20,471][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:06:20,790][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:06:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:06:21,429][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:06:21,747][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:06:22,066][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:06:22,385][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:06:22,704][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:06:23,022][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:06:23,340][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:06:23,659][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:06:23,978][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:06:24,638][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:06:25,368][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:06:25,370][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:06:25,372][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:06:26,063][__main__][INFO] - Iteration 325 took 27s (11.76% Gen, 85.69% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 1m 7s. Estimated total time: 7h 32m 29s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 14s, 500 more iterations: 3h 46m 14s. [2026-03-25 18:06:26,065][__main__][INFO] - Starting iteration 325. [2026-03-25 18:06:26,068][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:06:26,068][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:06:29,307][__main__][INFO] - Number of regex retries in iteration 325: 0 [2026-03-25 18:06:29,308][__main__][INFO] - agents played in iteration 325 are Alice, Bob [2026-03-25 18:06:29,837][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:06:30,495][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:06:30,785][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:06:31,104][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:06:31,423][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:06:31,741][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:06:32,059][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:06:32,378][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:06:32,697][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:06:33,015][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:06:33,334][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:06:33,652][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:06:33,971][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:06:34,290][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:06:34,609][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:06:34,929][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:06:35,249][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:06:35,568][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:06:35,888][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:06:36,208][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:06:36,528][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:06:36,850][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:06:37,170][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:06:37,490][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:06:37,809][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:06:38,130][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:06:38,450][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:06:38,769][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:06:39,090][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:06:39,410][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:06:39,729][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:06:40,050][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:06:40,371][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:06:40,691][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:06:41,011][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:06:41,331][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:06:41,651][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:06:41,971][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:06:42,291][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:06:42,611][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:06:42,930][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:06:43,249][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:06:43,569][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:06:43,890][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:06:44,211][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:06:44,531][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:06:44,850][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:06:45,169][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:06:45,488][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:06:45,808][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:06:46,126][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:06:46,445][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:06:46,763][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:06:47,382][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:06:47,701][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:06:48,020][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:06:48,339][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:06:48,659][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:06:48,978][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:06:49,296][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:06:49,615][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:06:49,935][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:06:50,254][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:06:50,572][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:06:50,892][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:06:51,212][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:06:51,875][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:06:52,610][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:06:52,612][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:06:52,614][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:06:53,258][__main__][INFO] - Iteration 326 took 27s (11.91% Gen, 85.72% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 1m 22s. Estimated total time: 7h 33m 10s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 35s. [2026-03-25 18:06:53,260][__main__][INFO] - Starting iteration 326. [2026-03-25 18:06:53,263][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:06:53,264][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:06:56,467][__main__][INFO] - Number of regex retries in iteration 326: 0 [2026-03-25 18:06:56,468][__main__][INFO] - agents played in iteration 326 are Alice, Bob [2026-03-25 18:06:57,005][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:06:57,665][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:06:57,954][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:06:58,274][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:06:58,594][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:06:58,912][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:06:59,232][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:06:59,550][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:06:59,870][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:07:00,189][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:07:00,508][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:07:00,828][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:07:01,146][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:07:01,466][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:07:01,784][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:07:02,102][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:07:02,420][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:07:02,739][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:07:03,057][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:07:03,376][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:07:03,694][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:07:04,012][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:07:04,331][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:07:04,649][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:07:04,968][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:07:05,286][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:07:05,604][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:07:05,924][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:07:06,242][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:07:06,561][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:07:06,880][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:07:07,199][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:07:07,517][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:07:07,836][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:07:08,156][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:07:08,475][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:07:08,795][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:07:09,113][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:07:09,431][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:07:09,749][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:07:10,069][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:07:10,388][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:07:10,707][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:07:11,026][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:07:11,345][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:07:11,664][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:07:11,982][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:07:12,300][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:07:12,619][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:07:12,937][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:07:13,256][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:07:13,574][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:07:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:07:14,509][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:07:14,829][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:07:15,148][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:07:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:07:15,785][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:07:16,103][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:07:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:07:16,741][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:07:17,059][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:07:17,377][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:07:17,695][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:07:18,014][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:07:18,333][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:07:18,996][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:07:19,739][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:07:19,741][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:07:19,743][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:07:20,388][__main__][INFO] - Iteration 327 took 27s (11.81% Gen, 85.80% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 59m 50s. Estimated total time: 7h 32m 6s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 12s, 500 more iterations: 3h 46m 3s. [2026-03-25 18:07:20,390][__main__][INFO] - Starting iteration 327. [2026-03-25 18:07:20,393][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:07:20,394][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:07:23,623][__main__][INFO] - Number of regex retries in iteration 327: 0 [2026-03-25 18:07:23,624][__main__][INFO] - agents played in iteration 327 are Alice, Bob [2026-03-25 18:07:24,156][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:07:24,815][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:07:25,104][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:07:25,425][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:07:25,745][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:07:26,064][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:07:26,383][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:07:26,702][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:07:27,020][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:07:27,339][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:07:27,657][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:07:27,976][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:07:28,295][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:07:28,613][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:07:28,932][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:07:29,251][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:07:29,569][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:07:29,888][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:07:30,206][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:07:30,525][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:07:30,845][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:07:31,164][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:07:31,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:07:31,802][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:07:32,121][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:07:32,440][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:07:32,758][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:07:33,078][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:07:33,397][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:07:33,717][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:07:34,035][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:07:34,354][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:07:34,673][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:07:34,992][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:07:35,311][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:07:35,630][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:07:35,950][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:07:36,268][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:07:36,588][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:07:36,906][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:07:37,225][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:07:37,544][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:07:37,863][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:07:38,182][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:07:38,501][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:07:38,820][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:07:39,139][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:07:39,459][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:07:39,777][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:07:40,095][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:07:40,414][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:07:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:07:41,051][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:07:41,667][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:07:41,986][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:07:42,305][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:07:42,623][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:07:42,942][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:07:43,261][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:07:43,580][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:07:43,899][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:07:44,217][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:07:44,536][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:07:44,855][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:07:45,174][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:07:45,492][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:07:46,154][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:07:46,889][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:07:46,891][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:07:46,893][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:07:47,548][__main__][INFO] - Iteration 328 took 27s (11.90% Gen, 85.69% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 59m 52s. Estimated total time: 7h 32m 35s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 17s. [2026-03-25 18:07:47,550][__main__][INFO] - Starting iteration 328. [2026-03-25 18:07:47,553][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:07:47,553][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:07:50,803][__main__][INFO] - Number of regex retries in iteration 328: 0 [2026-03-25 18:07:50,804][__main__][INFO] - agents played in iteration 328 are Alice, Bob [2026-03-25 18:07:51,351][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:07:52,019][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:07:52,310][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:07:52,630][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:07:52,950][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:07:53,270][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:07:53,589][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:07:53,908][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:07:54,228][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:07:54,548][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:07:54,868][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:07:55,188][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:07:55,508][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:07:55,828][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:07:56,148][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:07:56,467][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:07:56,785][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:07:57,105][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:07:57,425][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:07:57,746][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:07:58,065][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:07:58,384][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:07:58,703][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:07:59,021][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:07:59,340][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:07:59,658][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:07:59,978][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:08:00,297][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:08:00,616][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:08:00,936][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:08:01,254][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:08:01,573][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:08:01,891][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:08:02,210][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:08:02,529][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:08:02,849][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:08:03,167][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:08:03,486][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:08:03,806][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:08:04,124][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:08:04,443][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:08:04,762][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:08:05,081][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:08:05,400][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:08:05,719][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:08:06,037][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:08:06,355][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:08:06,674][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:08:06,993][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:08:07,311][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:08:07,631][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:08:07,956][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:08:08,275][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:08:08,897][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:08:09,217][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:08:09,536][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:08:09,855][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:08:10,174][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:08:10,492][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:08:10,811][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:08:11,129][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:08:11,449][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:08:11,769][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:08:12,089][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:08:12,408][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:08:12,727][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:08:13,389][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:08:14,125][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:08:14,127][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:08:14,128][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:08:14,792][__main__][INFO] - Iteration 329 took 27s (11.93% Gen, 85.62% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 0m 50s. Estimated total time: 7h 34m 0s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 24s, 500 more iterations: 3h 47m 0s. [2026-03-25 18:08:14,794][__main__][INFO] - Starting iteration 329. [2026-03-25 18:08:14,797][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:08:14,798][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:08:18,005][__main__][INFO] - Number of regex retries in iteration 329: 0 [2026-03-25 18:08:18,006][__main__][INFO] - agents played in iteration 329 are Alice, Bob [2026-03-25 18:08:18,542][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:08:19,200][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:08:19,489][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:08:19,809][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:08:20,128][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:08:20,447][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:08:20,765][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:08:21,083][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:08:21,402][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:08:21,720][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:08:22,038][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:08:22,356][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:08:22,675][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:08:22,993][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:08:23,311][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:08:23,630][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:08:23,949][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:08:24,267][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:08:24,586][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:08:24,904][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:08:25,222][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:08:25,541][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:08:25,860][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:08:26,179][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:08:26,498][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:08:26,816][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:08:27,135][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:08:27,453][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:08:27,772][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:08:28,090][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:08:28,410][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:08:28,729][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:08:29,048][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:08:29,366][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:08:29,684][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:08:30,004][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:08:30,322][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:08:30,641][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:08:30,960][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:08:31,279][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:08:31,597][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:08:31,916][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:08:32,235][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:08:32,553][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:08:32,872][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:08:33,191][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:08:33,511][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:08:33,830][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:08:34,150][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:08:34,470][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:08:34,788][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:08:35,107][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:08:35,426][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:08:36,043][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:08:36,363][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:08:36,681][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:08:36,999][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:08:37,320][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:08:37,639][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:08:37,957][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:08:38,276][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:08:38,595][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:08:38,914][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:08:39,233][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:08:39,552][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:08:39,873][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:08:40,535][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:08:41,276][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:08:41,278][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:08:41,280][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:08:41,963][__main__][INFO] - Iteration 330 took 27s (11.81% Gen, 85.67% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 59m 9s. Estimated total time: 7h 32m 46s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 16s, 500 more iterations: 3h 46m 23s. [2026-03-25 18:08:41,965][__main__][INFO] - Starting iteration 330. [2026-03-25 18:08:41,968][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:08:41,968][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:08:45,151][__main__][INFO] - Number of regex retries in iteration 330: 0 [2026-03-25 18:08:45,151][__main__][INFO] - agents played in iteration 330 are Alice, Bob [2026-03-25 18:08:45,677][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:08:46,338][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:08:46,626][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:08:46,945][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:08:47,265][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:08:47,584][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:08:47,902][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:08:48,221][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:08:48,539][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:08:48,859][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:08:49,178][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:08:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:08:49,817][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:08:50,136][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:08:50,455][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:08:50,774][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:08:51,093][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:08:51,413][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:08:51,733][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:08:52,052][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:08:52,371][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:08:52,691][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:08:53,010][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:08:53,330][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:08:53,649][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:08:53,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:08:54,288][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:08:54,606][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:08:54,925][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:08:55,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:08:55,563][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:08:55,882][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:08:56,200][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:08:56,518][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:08:56,837][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:08:57,156][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:08:57,474][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:08:57,793][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:08:58,113][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:08:58,431][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:08:58,750][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:08:59,070][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:08:59,389][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:08:59,709][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:09:00,028][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:09:00,346][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:09:00,666][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:09:00,985][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:09:01,304][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:09:01,622][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:09:01,940][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:09:02,259][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:09:02,579][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:09:03,195][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:09:03,514][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:09:03,833][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:09:04,153][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:09:04,472][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:09:04,791][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:09:05,111][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:09:05,431][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:09:05,752][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:09:06,072][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:09:06,393][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:09:06,713][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:09:07,032][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:09:07,706][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:09:08,450][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:09:08,453][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:09:08,454][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:09:09,137][__main__][INFO] - Iteration 331 took 27s (11.72% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 58m 45s. Estimated total time: 7h 32m 50s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 25s. [2026-03-25 18:09:09,139][__main__][INFO] - Starting iteration 331. [2026-03-25 18:09:09,142][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:09:09,143][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:09:11,236][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2026-03-25 18:09:12,439][__main__][INFO] - Number of regex retries in iteration 331: 1 [2026-03-25 18:09:12,440][__main__][INFO] - agents played in iteration 331 are Alice, Bob [2026-03-25 18:09:12,970][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:09:13,627][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:09:13,917][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:09:14,236][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:09:14,556][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:09:14,875][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:09:15,193][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:09:15,511][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:09:15,830][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:09:16,148][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:09:16,466][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:09:16,786][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:09:17,105][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:09:17,424][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:09:17,743][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:09:18,062][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:09:18,381][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:09:18,701][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:09:19,019][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:09:19,339][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:09:19,658][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:09:19,977][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:09:20,296][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:09:20,614][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:09:20,933][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:09:21,252][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:09:21,572][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:09:21,891][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:09:22,211][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:09:22,531][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:09:22,849][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:09:23,169][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:09:23,487][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:09:23,806][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:09:24,124][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:09:24,443][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:09:24,761][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:09:25,081][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:09:25,400][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:09:25,718][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:09:26,036][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:09:26,355][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:09:26,674][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:09:26,992][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:09:27,312][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:09:27,631][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:09:27,951][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:09:28,270][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:09:28,589][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:09:28,908][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:09:29,227][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:09:29,546][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:09:29,864][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:09:30,480][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:09:30,798][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:09:31,117][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:09:31,435][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:09:31,754][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:09:32,072][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:09:32,392][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:09:32,712][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:09:33,031][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:09:33,350][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:09:33,670][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:09:33,989][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:09:34,310][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:09:34,971][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:09:35,717][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:09:35,720][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:09:35,721][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:09:36,405][__main__][INFO] - Iteration 332 took 27s (12.09% Gen, 85.40% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 59m 51s. Estimated total time: 7h 34m 23s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 26s, 500 more iterations: 3h 47m 11s. [2026-03-25 18:09:36,407][__main__][INFO] - Starting iteration 332. [2026-03-25 18:09:36,410][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:09:36,411][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:09:39,544][__main__][INFO] - Number of regex retries in iteration 332: 0 [2026-03-25 18:09:39,545][__main__][INFO] - agents played in iteration 332 are Alice, Bob [2026-03-25 18:09:40,077][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:09:40,733][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:09:41,022][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:09:41,342][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:09:41,661][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:09:41,979][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:09:42,297][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:09:42,616][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:09:42,935][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:09:43,254][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:09:43,573][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:09:43,892][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:09:44,211][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:09:44,531][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:09:44,852][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:09:45,171][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:09:45,491][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:09:45,811][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:09:46,130][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:09:46,450][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:09:46,769][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:09:47,088][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:09:47,407][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:09:47,726][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:09:48,044][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:09:48,363][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:09:48,682][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:09:49,002][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:09:49,320][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:09:49,639][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:09:49,957][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:09:50,276][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:09:50,595][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:09:50,913][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:09:51,232][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:09:51,551][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:09:51,870][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:09:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:09:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:09:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:09:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:09:53,465][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:09:53,783][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:09:54,102][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:09:54,420][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:09:54,739][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:09:55,057][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:09:55,376][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:09:55,695][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:09:56,015][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:09:56,334][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:09:56,652][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:09:56,972][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:09:57,586][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:09:57,905][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:09:58,224][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:09:58,545][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:09:58,863][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:09:59,182][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:09:59,501][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:09:59,820][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:10:00,138][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:10:00,459][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:10:00,779][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:10:01,098][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:10:01,417][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:10:02,077][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:10:02,824][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:10:02,826][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:10:02,828][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:10:03,468][__main__][INFO] - Iteration 333 took 27s (11.58% Gen, 86.05% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 55m 59s. Estimated total time: 7h 30m 58s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 5s, 500 more iterations: 3h 45m 29s. [2026-03-25 18:10:03,470][__main__][INFO] - Starting iteration 333. [2026-03-25 18:10:03,473][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:10:03,474][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:10:06,642][__main__][INFO] - Number of regex retries in iteration 333: 0 [2026-03-25 18:10:06,643][__main__][INFO] - agents played in iteration 333 are Alice, Bob [2026-03-25 18:10:07,179][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:10:07,839][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:10:08,129][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:10:08,449][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:10:08,768][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:10:09,087][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:10:09,405][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:10:09,724][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:10:10,043][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:10:10,362][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:10:10,680][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:10:11,001][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:10:11,319][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:10:11,638][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:10:11,957][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:10:12,275][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:10:12,593][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:10:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:10:13,232][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:10:13,550][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:10:13,869][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:10:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:10:14,507][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:10:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:10:15,145][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:10:15,464][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:10:15,784][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:10:16,103][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:10:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:10:16,741][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:10:17,061][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:10:17,380][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:10:17,700][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:10:18,019][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:10:18,337][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:10:18,656][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:10:18,976][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:10:19,296][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:10:19,616][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:10:19,937][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:10:20,257][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:10:20,577][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:10:20,897][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:10:21,217][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:10:21,537][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:10:21,857][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:10:22,177][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:10:22,498][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:10:22,818][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:10:23,138][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:10:23,458][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:10:23,779][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:10:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:10:24,713][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:10:25,033][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:10:25,352][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:10:25,673][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:10:25,993][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:10:26,312][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:10:26,631][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:10:26,950][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:10:27,269][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:10:27,588][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:10:27,908][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:10:28,226][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:10:28,544][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:10:29,205][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:10:29,958][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:10:29,961][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:10:29,962][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:10:30,662][__main__][INFO] - Iteration 334 took 27s (11.66% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 57m 43s. Estimated total time: 7h 33m 9s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 34s. [2026-03-25 18:10:30,664][__main__][INFO] - Starting iteration 334. [2026-03-25 18:10:30,667][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:10:30,667][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:10:33,899][__main__][INFO] - Number of regex retries in iteration 334: 0 [2026-03-25 18:10:33,900][__main__][INFO] - agents played in iteration 334 are Alice, Bob [2026-03-25 18:10:34,439][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:10:35,096][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:10:35,385][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:10:35,703][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:10:36,023][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:10:36,342][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:10:36,660][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:10:36,979][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:10:37,298][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:10:37,617][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:10:37,935][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:10:38,254][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:10:38,574][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:10:38,893][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:10:39,212][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:10:39,531][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:10:39,850][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:10:40,170][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:10:40,488][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:10:40,807][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:10:41,125][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:10:41,444][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:10:41,762][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:10:42,081][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:10:42,401][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:10:42,719][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:10:43,038][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:10:43,357][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:10:43,677][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:10:43,996][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:10:44,315][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:10:44,634][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:10:44,954][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:10:45,273][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:10:45,591][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:10:45,911][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:10:46,230][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:10:46,549][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:10:46,868][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:10:47,187][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:10:47,506][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:10:47,825][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:10:48,144][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:10:48,462][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:10:48,782][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:10:49,101][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:10:49,418][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:10:49,738][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:10:50,057][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:10:50,376][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:10:50,695][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:10:51,015][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:10:51,335][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:10:51,949][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:10:52,268][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:10:52,588][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:10:52,907][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:10:53,226][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:10:53,544][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:10:53,865][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:10:54,184][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:10:54,503][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:10:54,821][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:10:55,140][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:10:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:10:55,779][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:10:56,439][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:10:57,184][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:10:57,186][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:10:57,188][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:10:57,894][__main__][INFO] - Iteration 335 took 27s (11.87% Gen, 85.53% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 57m 55s. Estimated total time: 7h 33m 48s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 22s, 500 more iterations: 3h 46m 54s. [2026-03-25 18:10:57,896][__main__][INFO] - Starting iteration 335. [2026-03-25 18:10:57,900][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:10:57,900][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:11:01,071][__main__][INFO] - Number of regex retries in iteration 335: 0 [2026-03-25 18:11:01,072][__main__][INFO] - agents played in iteration 335 are Alice, Bob [2026-03-25 18:11:01,606][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:11:02,264][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:11:02,553][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:11:02,873][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:11:03,192][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:11:03,511][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:11:03,830][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:11:04,148][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:11:04,468][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:11:04,787][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:11:05,106][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:11:05,425][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:11:05,745][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:11:06,064][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:11:06,385][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:11:06,704][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:11:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:11:07,343][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:11:07,662][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:11:07,981][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:11:08,300][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:11:08,619][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:11:08,938][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:11:09,257][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:11:09,577][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:11:09,896][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:11:10,215][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:11:10,533][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:11:10,853][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:11:11,173][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:11:11,492][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:11:11,812][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:11:12,132][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:11:12,451][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:11:12,769][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:11:13,088][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:11:13,408][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:11:13,727][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:11:14,046][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:11:14,365][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:11:14,685][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:11:15,005][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:11:15,325][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:11:15,644][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:11:15,964][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:11:16,282][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:11:16,601][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:11:16,920][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:11:17,239][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:11:17,558][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:11:17,877][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:11:18,196][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:11:18,515][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:11:19,130][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:11:19,449][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:11:19,768][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:11:20,087][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:11:20,405][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:11:20,724][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:11:21,042][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:11:21,361][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:11:21,679][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:11:21,998][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:11:22,317][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:11:22,637][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:11:22,956][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:11:23,623][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:11:24,366][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:11:24,368][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:11:24,370][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:11:25,021][__main__][INFO] - Iteration 336 took 27s (11.69% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 55m 41s. Estimated total time: 7h 32m 2s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 12s, 500 more iterations: 3h 46m 1s. [2026-03-25 18:11:25,023][__main__][INFO] - Starting iteration 336. [2026-03-25 18:11:25,026][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:11:25,027][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:11:28,243][__main__][INFO] - Number of regex retries in iteration 336: 0 [2026-03-25 18:11:28,244][__main__][INFO] - agents played in iteration 336 are Alice, Bob [2026-03-25 18:11:28,778][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:11:29,436][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:11:29,725][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:11:30,044][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:11:30,363][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:11:30,683][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:11:31,002][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:11:31,322][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:11:31,643][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:11:31,963][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:11:32,282][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:11:32,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:11:32,920][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:11:33,240][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:11:33,560][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:11:33,880][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:11:34,200][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:11:34,520][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:11:34,840][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:11:35,160][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:11:35,479][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:11:35,799][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:11:36,119][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:11:36,439][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:11:36,759][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:11:37,078][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:11:37,398][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:11:37,718][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:11:38,036][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:11:38,356][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:11:38,675][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:11:38,994][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:11:39,316][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:11:39,636][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:11:39,956][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:11:40,275][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:11:40,594][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:11:40,912][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:11:41,232][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:11:41,551][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:11:41,870][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:11:42,190][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:11:42,509][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:11:42,828][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:11:43,148][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:11:43,466][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:11:43,784][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:11:44,104][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:11:44,423][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:11:44,742][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:11:45,060][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:11:45,379][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:11:45,698][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:11:46,313][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:11:46,633][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:11:46,952][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:11:47,271][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:11:47,590][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:11:47,909][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:11:48,229][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:11:48,548][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:11:48,867][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:11:49,186][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:11:49,504][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:11:49,823][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:11:50,142][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:11:50,803][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:11:51,539][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:11:51,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:11:51,543][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:11:52,191][__main__][INFO] - Iteration 337 took 27s (11.84% Gen, 85.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 55m 58s. Estimated total time: 7h 32m 46s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 16s, 500 more iterations: 3h 46m 23s. [2026-03-25 18:11:52,194][__main__][INFO] - Starting iteration 337. [2026-03-25 18:11:52,197][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:11:52,197][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:11:55,390][__main__][INFO] - Number of regex retries in iteration 337: 0 [2026-03-25 18:11:55,391][__main__][INFO] - agents played in iteration 337 are Alice, Bob [2026-03-25 18:11:55,920][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:11:56,580][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:11:56,869][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:11:57,189][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:11:57,508][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:11:57,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:11:58,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:11:58,466][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:11:58,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:11:59,103][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:11:59,422][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:11:59,741][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:12:00,060][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:12:00,379][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:12:00,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:12:01,016][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:12:01,334][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:12:01,653][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:12:01,971][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:12:02,291][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:12:02,611][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:12:02,931][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:12:03,249][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:12:03,569][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:12:03,888][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:12:04,207][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:12:04,527][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:12:04,846][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:12:05,166][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:12:05,485][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:12:05,805][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:12:06,123][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:12:06,442][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:12:06,761][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:12:07,080][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:12:07,398][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:12:07,716][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:12:08,036][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:12:08,354][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:12:08,674][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:12:08,993][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:12:09,312][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:12:09,631][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:12:09,950][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:12:10,269][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:12:10,588][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:12:10,906][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:12:11,226][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:12:11,545][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:12:11,864][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:12:12,182][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:12:12,501][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:12:12,820][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:12:13,438][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:12:13,757][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:12:14,076][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:12:14,395][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:12:14,714][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:12:15,033][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:12:15,351][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:12:15,670][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:12:15,991][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:12:16,311][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:12:16,630][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:12:16,949][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:12:17,268][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:12:17,933][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:12:18,674][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:12:18,676][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:12:18,678][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:12:19,329][__main__][INFO] - Iteration 338 took 27s (11.77% Gen, 85.82% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 54m 58s. Estimated total time: 7h 32m 13s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 6s. [2026-03-25 18:12:19,332][__main__][INFO] - Starting iteration 338. [2026-03-25 18:12:19,334][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:12:19,335][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:12:22,516][__main__][INFO] - Number of regex retries in iteration 338: 0 [2026-03-25 18:12:22,517][__main__][INFO] - agents played in iteration 338 are Alice, Bob [2026-03-25 18:12:23,043][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:12:23,715][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:12:24,004][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:12:24,324][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:12:24,643][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:12:24,962][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:12:25,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:12:25,600][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:12:25,918][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:12:26,238][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:12:26,557][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:12:26,875][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:12:27,195][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:12:27,514][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:12:27,833][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:12:28,152][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:12:28,472][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:12:28,791][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:12:29,111][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:12:29,430][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:12:29,749][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:12:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:12:30,387][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:12:30,707][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:12:31,025][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:12:31,344][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:12:31,663][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:12:31,982][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:12:32,301][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:12:32,620][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:12:32,940][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:12:33,259][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:12:33,578][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:12:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:12:34,217][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:12:34,536][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:12:34,855][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:12:35,174][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:12:35,494][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:12:35,814][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:12:36,133][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:12:36,453][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:12:36,773][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:12:37,092][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:12:37,411][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:12:37,730][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:12:38,049][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:12:38,369][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:12:38,689][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:12:39,008][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:12:39,327][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:12:39,646][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:12:39,966][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:12:40,587][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:12:40,906][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:12:41,224][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:12:41,543][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:12:41,862][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:12:42,182][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:12:42,501][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:12:42,820][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:12:43,139][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:12:43,459][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:12:43,778][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:12:44,096][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:12:44,416][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:12:45,090][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:12:45,840][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:12:45,842][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:12:45,844][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:12:46,580][__main__][INFO] - Iteration 339 took 27s (11.68% Gen, 85.62% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 56m 24s. Estimated total time: 7h 34m 6s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 24s, 500 more iterations: 3h 47m 3s. [2026-03-25 18:12:46,582][__main__][INFO] - Starting iteration 339. [2026-03-25 18:12:46,585][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:12:46,586][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:12:49,774][__main__][INFO] - Number of regex retries in iteration 339: 0 [2026-03-25 18:12:49,775][__main__][INFO] - agents played in iteration 339 are Alice, Bob [2026-03-25 18:12:50,311][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:12:50,982][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:12:51,272][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:12:51,592][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:12:51,910][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:12:52,229][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:12:52,548][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:12:52,868][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:12:53,188][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:12:53,507][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:12:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:12:54,146][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:12:54,464][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:12:54,782][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:12:55,101][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:12:55,420][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:12:55,738][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:12:56,056][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:12:56,375][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:12:56,694][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:12:57,013][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:12:57,332][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:12:57,651][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:12:57,970][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:12:58,289][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:12:58,607][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:12:58,925][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:12:59,244][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:12:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:12:59,882][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:13:00,201][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:13:00,520][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:13:00,839][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:13:01,158][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:13:01,477][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:13:01,796][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:13:02,115][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:13:02,433][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:13:02,752][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:13:03,071][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:13:03,391][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:13:03,712][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:13:04,031][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:13:04,350][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:13:04,669][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:13:04,987][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:13:05,306][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:13:05,625][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:13:05,944][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:13:06,262][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:13:06,581][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:13:06,900][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:13:07,219][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:13:07,837][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:13:08,155][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:13:08,474][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:13:08,793][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:13:09,112][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:13:09,432][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:13:09,751][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:13:10,070][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:13:10,389][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:13:10,708][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:13:11,027][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:13:11,346][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:13:11,664][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:13:12,328][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:13:13,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:13:13,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:13:13,068][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:13:13,717][__main__][INFO] - Iteration 340 took 27s (11.75% Gen, 85.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 54m 3s. Estimated total time: 7h 32m 12s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 6s. [2026-03-25 18:13:13,719][__main__][INFO] - Starting iteration 340. [2026-03-25 18:13:13,722][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:13:13,722][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:13:16,898][__main__][INFO] - Number of regex retries in iteration 340: 0 [2026-03-25 18:13:16,899][__main__][INFO] - agents played in iteration 340 are Alice, Bob [2026-03-25 18:13:17,428][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:13:18,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:13:18,376][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:13:18,695][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:13:19,013][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:13:19,331][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:13:19,650][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:13:19,969][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:13:20,288][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:13:20,606][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:13:20,924][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:13:21,242][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:13:21,561][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:13:21,880][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:13:22,199][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:13:22,517][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:13:22,836][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:13:23,157][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:13:23,475][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:13:23,794][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:13:24,112][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:13:24,431][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:13:24,751][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:13:25,071][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:13:25,390][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:13:25,710][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:13:26,030][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:13:26,348][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:13:26,667][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:13:26,987][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:13:27,305][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:13:27,624][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:13:27,942][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:13:28,261][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:13:28,579][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:13:28,897][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:13:29,216][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:13:29,535][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:13:29,854][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:13:30,173][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:13:30,492][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:13:30,811][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:13:31,130][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:13:31,449][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:13:31,769][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:13:32,087][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:13:32,406][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:13:32,725][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:13:33,044][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:13:33,363][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:13:33,682][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:13:34,000][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:13:34,318][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:13:34,935][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:13:35,254][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:13:35,573][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:13:35,892][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:13:36,211][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:13:36,530][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:13:36,849][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:13:37,168][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:13:37,487][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:13:37,806][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:13:38,125][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:13:38,443][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:13:38,762][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:13:39,427][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:13:40,174][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:13:40,176][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:13:40,178][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:13:40,829][__main__][INFO] - Iteration 341 took 27s (11.72% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 53m 12s. Estimated total time: 7h 31m 48s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 10s, 500 more iterations: 3h 45m 54s. [2026-03-25 18:13:40,831][__main__][INFO] - Starting iteration 341. [2026-03-25 18:13:40,835][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:13:40,836][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:13:43,988][__main__][INFO] - Number of regex retries in iteration 341: 0 [2026-03-25 18:13:43,989][__main__][INFO] - agents played in iteration 341 are Alice, Bob [2026-03-25 18:13:44,518][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:13:45,178][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:13:45,468][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:13:45,787][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:13:46,105][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:13:46,425][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:13:46,744][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:13:47,062][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:13:47,381][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:13:47,701][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:13:48,019][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:13:48,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:13:48,656][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:13:48,974][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:13:49,293][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:13:49,612][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:13:49,932][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:13:50,251][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:13:50,570][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:13:50,888][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:13:51,207][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:13:51,526][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:13:51,843][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:13:52,162][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:13:52,480][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:13:52,798][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:13:53,117][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:13:53,437][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:13:53,755][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:13:54,073][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:13:54,392][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:13:54,710][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:13:55,029][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:13:55,349][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:13:55,667][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:13:55,986][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:13:56,304][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:13:56,623][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:13:56,941][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:13:57,261][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:13:57,580][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:13:57,899][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:13:58,218][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:13:58,537][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:13:58,856][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:13:59,175][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:13:59,495][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:13:59,815][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:14:00,134][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:14:00,454][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:14:00,774][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:14:01,094][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:14:01,412][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:14:02,033][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:14:02,353][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:14:02,673][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:14:02,993][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:14:03,313][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:14:03,632][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:14:03,952][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:14:04,272][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:14:04,592][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:14:04,912][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:14:05,232][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:14:05,551][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:14:05,871][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:14:06,537][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:14:07,269][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:14:07,271][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:14:07,273][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:14:07,923][__main__][INFO] - Iteration 342 took 27s (11.64% Gen, 85.95% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 52m 25s. Estimated total time: 7h 31m 29s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 8s, 500 more iterations: 3h 45m 44s. [2026-03-25 18:14:07,925][__main__][INFO] - Starting iteration 342. [2026-03-25 18:14:07,928][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:14:07,928][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:14:11,132][__main__][INFO] - Number of regex retries in iteration 342: 0 [2026-03-25 18:14:11,133][__main__][INFO] - agents played in iteration 342 are Alice, Bob [2026-03-25 18:14:11,665][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:14:12,325][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:14:12,615][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:14:12,934][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:14:13,253][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:14:13,573][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:14:13,892][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:14:14,211][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:14:14,530][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:14:14,850][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:14:15,168][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:14:15,486][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:14:15,804][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:14:16,122][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:14:16,442][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:14:16,761][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:14:17,079][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:14:17,398][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:14:17,718][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:14:18,036][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:14:18,355][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:14:18,674][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:14:18,994][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:14:19,313][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:14:19,632][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:14:19,951][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:14:20,270][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:14:20,590][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:14:20,910][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:14:21,229][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:14:21,547][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:14:21,865][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:14:22,184][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:14:22,503][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:14:22,821][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:14:23,139][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:14:23,458][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:14:23,778][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:14:24,097][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:14:24,415][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:14:24,733][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:14:25,052][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:14:25,371][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:14:25,690][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:14:26,009][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:14:26,327][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:14:26,645][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:14:26,964][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:14:27,283][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:14:27,601][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:14:27,919][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:14:28,238][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:14:28,556][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:14:29,179][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:14:29,498][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:14:29,818][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:14:30,138][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:14:30,457][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:14:30,776][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:14:31,095][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:14:31,414][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:14:31,733][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:14:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:14:32,372][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:14:32,692][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:14:33,012][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:14:33,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:14:34,434][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:14:34,436][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:14:34,437][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:14:35,145][__main__][INFO] - Iteration 343 took 27s (11.77% Gen, 85.62% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 54m 7s. Estimated total time: 7h 33m 37s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 48s. [2026-03-25 18:14:35,147][__main__][INFO] - Starting iteration 343. [2026-03-25 18:14:35,150][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:14:35,150][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:14:38,337][__main__][INFO] - Number of regex retries in iteration 343: 0 [2026-03-25 18:14:38,337][__main__][INFO] - agents played in iteration 343 are Alice, Bob [2026-03-25 18:14:38,865][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:14:39,538][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:14:39,827][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:14:40,146][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:14:40,464][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:14:40,782][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:14:41,100][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:14:41,418][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:14:41,737][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:14:42,057][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:14:42,376][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:14:42,694][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:14:43,013][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:14:43,331][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:14:43,651][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:14:43,970][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:14:44,290][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:14:44,610][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:14:44,929][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:14:45,247][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:14:45,566][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:14:45,885][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:14:46,203][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:14:46,522][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:14:46,842][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:14:47,160][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:14:47,479][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:14:47,797][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:14:48,116][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:14:48,437][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:14:48,756][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:14:49,075][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:14:49,395][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:14:49,714][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:14:50,033][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:14:50,353][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:14:50,673][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:14:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:14:51,313][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:14:51,632][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:14:51,952][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:14:52,271][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:14:52,592][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:14:52,911][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:14:53,230][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:14:53,548][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:14:53,867][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:14:54,188][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:14:54,506][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:14:54,825][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:14:55,143][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:14:55,462][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:14:55,781][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:14:56,404][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:14:56,724][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:14:57,043][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:14:57,362][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:14:57,681][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:14:58,000][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:14:58,319][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:14:58,637][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:14:58,955][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:14:59,274][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:14:59,593][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:14:59,912][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:15:00,232][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:15:00,909][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:15:01,657][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:15:01,659][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:15:01,661][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:15:02,363][__main__][INFO] - Iteration 344 took 27s (11.71% Gen, 85.70% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 53m 36s. Estimated total time: 7h 33m 34s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 21s, 500 more iterations: 3h 46m 47s. [2026-03-25 18:15:02,365][__main__][INFO] - Starting iteration 344. [2026-03-25 18:15:02,369][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:15:02,369][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:15:05,573][__main__][INFO] - Number of regex retries in iteration 344: 0 [2026-03-25 18:15:05,574][__main__][INFO] - agents played in iteration 344 are Alice, Bob [2026-03-25 18:15:06,103][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:15:06,777][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:15:07,066][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:15:07,386][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:15:07,705][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:15:08,025][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:15:08,344][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:15:08,662][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:15:08,981][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:15:09,300][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:15:09,619][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:15:09,937][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:15:10,255][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:15:10,575][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:15:10,893][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:15:11,212][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:15:11,531][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:15:11,850][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:15:12,169][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:15:12,489][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:15:12,807][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:15:13,126][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:15:13,445][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:15:13,764][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:15:14,084][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:15:14,405][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:15:14,725][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:15:15,045][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:15:15,364][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:15:15,683][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:15:16,003][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:15:16,323][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:15:16,644][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:15:16,964][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:15:17,285][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:15:17,604][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:15:17,924][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:15:18,244][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:15:18,563][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:15:18,884][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:15:19,204][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:15:19,524][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:15:19,844][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:15:20,164][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:15:20,483][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:15:20,803][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:15:21,123][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:15:21,444][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:15:21,764][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:15:22,084][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:15:22,403][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:15:22,723][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:15:23,041][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:15:23,665][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:15:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:15:24,303][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:15:24,622][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:15:24,941][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:15:25,260][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:15:25,579][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:15:25,897][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:15:26,216][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:15:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:15:26,854][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:15:27,174][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:15:27,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:15:28,159][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:15:28,896][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:15:28,898][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:15:28,899][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:15:30,187][__main__][INFO] - Iteration 345 took 27s (11.52% Gen, 83.85% Train). Generation: 3s, Training: 23s. Estimated remaining time: 5h 3m 13s. Estimated total time: 7h 43m 39s. Time estimates for 10 more iterations: 4m 38s, 100 more iterations: 46m 21s, 500 more iterations: 3h 51m 49s. [2026-03-25 18:15:30,189][__main__][INFO] - Starting iteration 345. [2026-03-25 18:15:30,192][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:15:30,193][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:15:33,405][__main__][INFO] - Number of regex retries in iteration 345: 0 [2026-03-25 18:15:33,406][__main__][INFO] - agents played in iteration 345 are Alice, Bob [2026-03-25 18:15:33,937][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:15:34,610][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:15:34,899][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:15:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:15:35,537][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:15:35,857][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:15:36,176][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:15:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:15:36,813][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:15:37,132][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:15:37,451][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:15:37,770][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:15:38,088][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:15:38,407][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:15:38,726][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:15:39,044][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:15:39,363][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:15:39,682][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:15:40,002][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:15:40,321][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:15:40,639][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:15:40,958][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:15:41,277][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:15:41,596][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:15:41,914][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:15:42,233][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:15:42,551][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:15:42,870][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:15:43,189][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:15:43,509][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:15:43,828][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:15:44,148][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:15:44,466][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:15:44,786][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:15:45,105][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:15:45,424][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:15:45,743][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:15:46,061][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:15:46,380][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:15:46,699][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:15:47,018][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:15:47,338][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:15:47,657][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:15:47,977][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:15:48,296][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:15:48,615][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:15:48,934][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:15:49,253][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:15:49,572][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:15:49,891][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:15:50,210][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:15:50,529][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:15:50,847][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:15:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:15:51,782][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:15:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:15:52,420][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:15:52,739][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:15:53,057][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:15:53,377][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:15:53,695][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:15:54,014][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:15:54,334][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:15:54,653][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:15:54,973][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:15:55,291][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:15:55,954][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:15:56,691][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:15:56,693][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:15:56,695][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:15:57,343][__main__][INFO] - Iteration 346 took 27s (11.83% Gen, 85.77% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 51m 38s. Estimated total time: 7h 32m 31s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 15s. [2026-03-25 18:15:57,345][__main__][INFO] - Starting iteration 346. [2026-03-25 18:15:57,348][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:15:57,349][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:16:00,553][__main__][INFO] - Number of regex retries in iteration 346: 0 [2026-03-25 18:16:00,553][__main__][INFO] - agents played in iteration 346 are Alice, Bob [2026-03-25 18:16:01,085][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:16:01,746][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:16:02,035][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:16:02,354][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:16:02,673][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:16:02,991][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:16:03,310][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:16:03,628][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:16:03,946][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:16:04,265][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:16:04,583][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:16:04,904][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:16:05,222][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:16:05,542][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:16:05,861][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:16:06,180][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:16:06,498][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:16:06,819][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:16:07,138][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:16:07,457][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:16:07,776][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:16:08,095][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:16:08,414][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:16:08,733][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:16:09,052][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:16:09,371][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:16:09,690][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:16:10,009][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:16:10,328][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:16:10,647][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:16:10,965][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:16:11,284][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:16:11,603][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:16:11,922][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:16:12,242][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:16:12,561][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:16:12,879][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:16:13,199][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:16:13,517][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:16:13,836][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:16:14,155][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:16:14,475][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:16:14,793][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:16:15,112][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:16:15,431][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:16:15,751][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:16:16,070][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:16:16,389][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:16:16,708][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:16:17,027][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:16:17,346][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:16:17,664][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:16:17,983][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:16:18,599][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:16:18,918][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:16:19,239][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:16:19,558][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:16:19,877][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:16:20,196][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:16:20,515][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:16:20,834][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:16:21,153][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:16:21,472][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:16:21,790][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:16:22,109][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:16:22,429][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:16:23,094][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:16:23,848][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:16:23,851][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:16:23,852][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:16:24,567][__main__][INFO] - Iteration 347 took 27s (11.77% Gen, 85.59% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 52m 19s. Estimated total time: 7h 33m 40s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 22s, 500 more iterations: 3h 46m 50s. [2026-03-25 18:16:24,570][__main__][INFO] - Starting iteration 347. [2026-03-25 18:16:24,573][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:16:24,573][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:16:27,799][__main__][INFO] - Number of regex retries in iteration 347: 0 [2026-03-25 18:16:27,799][__main__][INFO] - agents played in iteration 347 are Alice, Bob [2026-03-25 18:16:28,361][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:16:29,033][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:16:29,322][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:16:29,642][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:16:29,961][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:16:30,281][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:16:30,601][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:16:30,921][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:16:31,240][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:16:31,560][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:16:31,879][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:16:32,198][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:16:32,518][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:16:32,839][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:16:33,158][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:16:33,478][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:16:33,798][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:16:34,117][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:16:34,436][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:16:34,755][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:16:35,074][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:16:35,393][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:16:35,714][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:16:36,035][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:16:36,353][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:16:36,672][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:16:36,991][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:16:37,310][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:16:37,628][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:16:37,947][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:16:38,265][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:16:38,584][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:16:38,903][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:16:39,222][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:16:39,540][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:16:39,859][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:16:40,178][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:16:40,497][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:16:40,816][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:16:41,135][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:16:41,453][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:16:41,773][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:16:42,092][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:16:42,411][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:16:42,730][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:16:43,049][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:16:43,367][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:16:43,685][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:16:44,004][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:16:44,323][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:16:44,642][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:16:44,960][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:16:45,279][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:16:45,897][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:16:46,216][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:16:46,535][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:16:46,854][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:16:47,172][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:16:47,491][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:16:47,812][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:16:48,132][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:16:48,451][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:16:48,769][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:16:49,088][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:16:49,407][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:16:49,726][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:16:50,390][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:16:51,147][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:16:51,149][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:16:51,151][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:16:51,859][__main__][INFO] - Iteration 348 took 27s (11.82% Gen, 85.58% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 52m 59s. Estimated total time: 7h 34m 47s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 23s. [2026-03-25 18:16:51,861][__main__][INFO] - Starting iteration 348. [2026-03-25 18:16:51,864][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:16:51,864][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:16:55,043][__main__][INFO] - Number of regex retries in iteration 348: 0 [2026-03-25 18:16:55,044][__main__][INFO] - agents played in iteration 348 are Alice, Bob [2026-03-25 18:16:55,578][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:16:56,238][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:16:56,527][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:16:56,847][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:16:57,167][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:16:57,487][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:16:57,805][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:16:58,124][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:16:58,443][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:16:58,761][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:16:59,080][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:16:59,398][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:16:59,717][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:17:00,036][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:17:00,356][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:17:00,676][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:17:00,995][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:17:01,313][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:17:01,632][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:17:01,952][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:17:02,273][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:17:02,592][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:17:02,912][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:17:03,231][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:17:03,551][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:17:03,870][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:17:04,188][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:17:04,507][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:17:04,826][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:17:05,146][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:17:05,467][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:17:05,787][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:17:06,106][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:17:06,425][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:17:06,743][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:17:07,062][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:17:07,381][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:17:07,701][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:17:08,020][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:17:08,338][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:17:08,657][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:17:08,977][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:17:09,296][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:17:09,616][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:17:09,935][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:17:10,254][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:17:10,573][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:17:10,893][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:17:11,212][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:17:11,532][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:17:11,851][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:17:12,169][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:17:12,489][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:17:13,107][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:17:13,426][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:17:13,746][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:17:14,065][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:17:14,385][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:17:14,704][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:17:15,024][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:17:15,343][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:17:15,662][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:17:15,981][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:17:16,300][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:17:16,619][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:17:16,939][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:17:17,604][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:17:18,353][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:17:18,355][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:17:18,357][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:17:19,058][__main__][INFO] - Iteration 349 took 27s (11.69% Gen, 85.72% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 51m 0s. Estimated total time: 7h 33m 15s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 37s. [2026-03-25 18:17:19,060][__main__][INFO] - Starting iteration 349. [2026-03-25 18:17:19,063][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:17:19,064][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:17:22,239][__main__][INFO] - Number of regex retries in iteration 349: 0 [2026-03-25 18:17:22,240][__main__][INFO] - agents played in iteration 349 are Alice, Bob [2026-03-25 18:17:22,773][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:17:23,435][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:17:23,724][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:17:24,044][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:17:24,362][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:17:24,681][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:17:25,000][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:17:25,318][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:17:25,636][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:17:25,955][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:17:26,273][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:17:26,592][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:17:26,911][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:17:27,231][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:17:27,550][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:17:27,871][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:17:28,190][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:17:28,509][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:17:28,828][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:17:29,148][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:17:29,468][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:17:29,787][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:17:30,106][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:17:30,425][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:17:30,743][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:17:31,062][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:17:31,380][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:17:31,698][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:17:32,017][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:17:32,336][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:17:32,654][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:17:32,973][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:17:33,291][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:17:33,610][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:17:33,929][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:17:34,247][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:17:34,566][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:17:34,885][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:17:35,203][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:17:35,522][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:17:35,841][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:17:36,159][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:17:36,478][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:17:36,797][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:17:37,116][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:17:37,434][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:17:37,752][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:17:38,073][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:17:38,392][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:17:38,711][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:17:39,031][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:17:39,350][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:17:39,669][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:17:40,286][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:17:40,605][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:17:40,924][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:17:41,243][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:17:41,561][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:17:41,881][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:17:42,199][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:17:42,518][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:17:42,837][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:17:43,158][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:17:43,477][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:17:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:17:44,117][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:17:44,792][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:17:45,539][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:17:45,542][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:17:45,543][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:17:46,266][__main__][INFO] - Iteration 350 took 27s (11.67% Gen, 85.66% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 50m 42s. Estimated total time: 7h 33m 24s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 20s, 500 more iterations: 3h 46m 42s. [2026-03-25 18:17:46,269][__main__][INFO] - Starting iteration 350. [2026-03-25 18:17:46,271][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 6 and human policies 1. [2026-03-25 18:17:46,272][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:17:49,512][__main__][INFO] - Number of regex retries in iteration 350: 0 [2026-03-25 18:17:49,513][__main__][INFO] - agents played in iteration 350 are Alice, Bob [2026-03-25 18:17:50,060][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:17:50,721][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:17:51,010][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:17:51,330][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:17:51,648][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:17:51,966][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:17:52,285][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:17:52,605][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:17:52,923][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:17:53,241][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:17:53,559][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:17:53,878][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:17:54,197][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:17:54,516][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:17:54,835][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:17:55,153][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:17:55,471][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:17:55,790][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:17:56,109][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:17:56,428][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:17:56,747][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:17:57,067][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:17:57,386][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:17:57,705][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:17:58,024][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:17:58,343][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:17:58,662][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:17:58,981][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:17:59,300][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:17:59,619][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:17:59,938][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:18:00,257][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:18:00,576][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:18:00,895][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:18:01,215][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:18:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:18:01,852][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:18:02,173][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:18:02,493][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:18:02,813][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:18:03,133][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:18:03,452][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:18:03,771][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:18:04,090][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:18:04,409][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:18:04,728][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:18:05,047][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:18:05,366][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:18:05,684][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:18:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:18:06,321][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:18:06,639][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:18:06,959][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:18:07,575][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:18:07,894][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:18:08,213][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:18:08,532][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:18:08,851][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:18:09,170][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:18:09,489][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:18:09,808][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:18:10,128][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:18:10,447][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:18:10,766][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:18:11,086][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:18:11,405][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:18:12,072][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:18:12,829][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:18:12,831][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:18:12,833][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:18:14,022][__main__][INFO] - Iteration 351 took 27s (11.68% Gen, 84.03% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 59m 22s. Estimated total time: 7h 42m 32s. Time estimates for 10 more iterations: 4m 37s, 100 more iterations: 46m 15s, 500 more iterations: 3h 51m 16s. [2026-03-25 18:18:14,025][__main__][INFO] - Starting iteration 351. [2026-03-25 18:18:14,028][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:18:14,028][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:18:17,248][__main__][INFO] - Number of regex retries in iteration 351: 0 [2026-03-25 18:18:17,249][__main__][INFO] - agents played in iteration 351 are Alice, Bob [2026-03-25 18:18:17,779][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:18:18,441][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:18:18,730][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:18:19,049][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:18:19,368][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:18:19,688][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:18:20,007][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:18:20,327][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:18:20,645][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:18:20,964][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:18:21,282][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:18:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:18:21,919][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:18:22,239][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:18:22,559][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:18:22,877][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:18:23,197][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:18:23,517][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:18:23,835][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:18:24,153][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:18:24,472][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:18:24,792][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:18:25,112][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:18:25,430][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:18:25,749][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:18:26,067][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:18:26,386][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:18:26,704][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:18:27,023][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:18:27,342][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:18:27,661][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:18:27,980][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:18:28,300][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:18:28,620][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:18:28,938][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:18:29,257][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:18:29,575][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:18:29,894][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:18:30,214][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:18:30,534][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:18:30,853][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:18:31,172][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:18:31,490][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:18:31,809][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:18:32,128][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:18:32,446][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:18:32,765][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:18:33,085][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:18:33,404][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:18:33,722][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:18:34,041][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:18:34,360][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:18:34,679][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:18:35,297][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:18:35,616][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:18:35,934][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:18:36,253][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:18:36,573][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:18:36,893][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:18:37,213][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:18:37,532][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:18:37,851][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:18:38,169][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:18:38,488][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:18:38,807][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:18:39,126][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:18:39,792][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:18:40,533][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:18:40,535][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:18:40,537][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:18:41,183][__main__][INFO] - Iteration 352 took 27s (11.86% Gen, 85.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 48m 59s. Estimated total time: 7h 32m 36s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 18s. [2026-03-25 18:18:41,185][__main__][INFO] - Starting iteration 352. [2026-03-25 18:18:41,188][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:18:41,189][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:18:44,407][__main__][INFO] - Number of regex retries in iteration 352: 0 [2026-03-25 18:18:44,407][__main__][INFO] - agents played in iteration 352 are Alice, Bob [2026-03-25 18:18:44,940][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:18:45,602][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:18:45,892][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:18:46,212][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:18:46,530][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:18:46,848][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:18:47,167][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:18:47,487][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:18:47,805][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:18:48,124][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:18:48,443][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:18:48,761][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:18:49,080][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:18:49,398][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:18:49,717][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:18:50,036][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:18:50,355][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:18:50,674][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:18:50,993][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:18:51,312][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:18:51,631][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:18:51,950][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:18:52,269][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:18:52,588][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:18:52,906][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:18:53,224][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:18:53,544][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:18:53,863][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:18:54,182][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:18:54,500][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:18:54,820][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:18:55,139][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:18:55,458][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:18:55,778][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:18:56,097][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:18:56,417][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:18:56,737][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:18:57,056][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:18:57,377][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:18:57,696][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:18:58,016][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:18:58,336][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:18:58,655][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:18:58,974][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:18:59,294][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:18:59,614][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:18:59,933][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:19:00,253][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:19:00,574][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:19:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:19:01,214][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:19:01,534][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:19:01,854][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:19:02,475][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:19:02,795][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:19:03,114][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:19:03,434][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:19:03,755][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:19:04,074][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:19:04,394][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:19:04,714][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:19:05,033][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:19:05,352][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:19:05,671][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:19:05,990][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:19:06,308][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:19:06,974][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:19:07,719][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:19:07,721][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:19:07,723][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:19:08,444][__main__][INFO] - Iteration 353 took 27s (11.81% Gen, 85.54% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 50m 12s. Estimated total time: 7h 34m 16s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 8s. [2026-03-25 18:19:08,446][__main__][INFO] - Starting iteration 353. [2026-03-25 18:19:08,449][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:19:08,450][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:19:11,637][__main__][INFO] - Number of regex retries in iteration 353: 0 [2026-03-25 18:19:11,638][__main__][INFO] - agents played in iteration 353 are Alice, Bob [2026-03-25 18:19:12,165][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:19:12,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:19:13,115][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:19:13,435][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:19:13,755][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:19:14,073][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:19:14,392][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:19:14,711][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:19:15,031][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:19:15,350][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:19:15,668][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:19:15,987][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:19:16,306][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:19:16,625][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:19:16,944][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:19:17,263][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:19:17,582][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:19:17,901][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:19:18,219][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:19:18,538][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:19:18,856][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:19:19,176][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:19:19,495][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:19:19,814][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:19:20,133][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:19:20,452][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:19:20,771][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:19:21,091][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:19:21,410][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:19:21,730][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:19:22,048][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:19:22,367][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:19:22,685][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:19:23,003][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:19:23,321][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:19:23,640][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:19:23,958][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:19:24,277][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:19:24,596][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:19:24,915][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:19:25,234][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:19:25,553][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:19:25,872][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:19:26,191][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:19:26,510][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:19:26,829][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:19:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:19:27,468][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:19:27,788][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:19:28,107][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:19:28,426][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:19:28,745][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:19:29,064][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:19:29,687][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:19:30,006][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:19:30,325][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:19:30,644][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:19:30,963][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:19:31,283][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:19:31,601][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:19:31,921][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:19:32,239][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:19:32,558][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:19:32,877][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:19:33,196][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:19:33,515][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:19:34,190][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:19:34,947][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:19:34,950][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:19:34,951][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:19:35,690][__main__][INFO] - Iteration 354 took 27s (11.70% Gen, 85.58% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 49m 31s. Estimated total time: 7h 34m 2s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 24s, 500 more iterations: 3h 47m 1s. [2026-03-25 18:19:35,693][__main__][INFO] - Starting iteration 354. [2026-03-25 18:19:35,696][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:19:35,696][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:19:38,888][__main__][INFO] - Number of regex retries in iteration 354: 0 [2026-03-25 18:19:38,889][__main__][INFO] - agents played in iteration 354 are Alice, Bob [2026-03-25 18:19:39,415][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:19:40,087][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:19:40,377][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:19:40,696][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:19:41,015][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:19:41,334][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:19:41,654][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:19:41,973][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:19:42,292][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:19:42,613][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:19:42,932][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:19:43,251][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:19:43,571][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:19:43,889][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:19:44,208][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:19:44,527][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:19:44,846][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:19:45,165][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:19:45,484][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:19:45,803][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:19:46,121][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:19:46,440][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:19:46,759][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:19:47,078][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:19:47,397][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:19:47,716][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:19:48,034][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:19:48,353][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:19:48,673][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:19:48,994][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:19:49,314][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:19:49,633][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:19:49,953][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:19:50,273][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:19:50,593][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:19:50,913][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:19:51,232][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:19:51,551][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:19:51,870][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:19:52,189][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:19:52,508][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:19:52,827][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:19:53,146][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:19:53,465][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:19:53,786][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:19:54,105][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:19:54,424][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:19:54,743][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:19:55,061][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:19:55,381][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:19:55,699][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:19:56,018][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:19:56,337][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:19:56,964][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:19:57,283][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:19:57,601][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:19:57,920][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:19:58,239][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:19:58,558][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:19:58,876][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:19:59,195][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:19:59,515][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:19:59,834][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:20:00,154][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:20:00,474][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:20:00,794][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:20:01,470][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:20:02,217][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:20:02,219][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:20:02,221][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:20:02,850][__main__][INFO] - Iteration 355 took 27s (11.76% Gen, 85.92% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 47m 37s. Estimated total time: 7h 32m 35s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 17s. [2026-03-25 18:20:02,853][__main__][INFO] - Starting iteration 355. [2026-03-25 18:20:02,855][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:20:02,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:20:06,092][__main__][INFO] - Number of regex retries in iteration 355: 0 [2026-03-25 18:20:06,093][__main__][INFO] - agents played in iteration 355 are Alice, Bob [2026-03-25 18:20:06,621][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:20:07,293][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:20:07,583][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:20:07,903][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:20:08,221][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:20:08,540][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:20:08,861][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:20:09,180][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:20:09,499][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:20:09,819][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:20:10,138][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:20:10,457][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:20:10,776][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:20:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:20:11,415][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:20:11,735][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:20:12,055][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:20:12,375][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:20:12,695][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:20:13,015][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:20:13,335][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:20:13,656][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:20:13,976][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:20:14,296][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:20:14,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:20:14,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:20:15,254][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:20:15,575][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:20:15,895][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:20:16,215][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:20:16,535][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:20:16,855][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:20:17,175][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:20:17,493][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:20:17,812][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:20:18,134][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:20:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:20:18,775][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:20:19,095][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:20:19,414][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:20:19,732][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:20:20,052][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:20:20,370][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:20:20,689][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:20:21,008][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:20:21,327][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:20:21,647][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:20:21,966][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:20:22,285][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:20:22,604][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:20:22,922][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:20:23,240][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:20:23,559][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:20:24,181][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:20:24,501][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:20:24,819][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:20:25,139][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:20:25,458][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:20:25,777][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:20:26,096][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:20:26,415][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:20:26,735][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:20:27,054][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:20:27,374][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:20:27,694][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:20:28,013][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:20:28,688][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:20:29,511][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:20:29,513][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:20:29,514][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:20:30,166][__main__][INFO] - Iteration 356 took 27s (11.85% Gen, 85.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 49m 45s. Estimated total time: 7h 35m 11s. Time estimates for 10 more iterations: 4m 33s, 100 more iterations: 45m 31s, 500 more iterations: 3h 47m 35s. [2026-03-25 18:20:30,168][__main__][INFO] - Starting iteration 356. [2026-03-25 18:20:30,171][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:20:30,171][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:20:33,372][__main__][INFO] - Number of regex retries in iteration 356: 0 [2026-03-25 18:20:33,373][__main__][INFO] - agents played in iteration 356 are Alice, Bob [2026-03-25 18:20:33,909][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:20:34,581][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:20:34,871][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:20:35,190][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:20:35,509][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:20:35,827][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:20:36,146][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:20:36,466][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:20:36,785][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:20:37,104][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:20:37,424][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:20:37,744][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:20:38,063][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:20:38,382][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:20:38,702][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:20:39,021][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:20:39,339][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:20:39,658][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:20:39,976][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:20:40,296][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:20:40,616][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:20:40,935][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:20:41,254][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:20:41,573][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:20:41,893][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:20:42,212][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:20:42,531][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:20:42,851][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:20:43,170][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:20:43,489][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:20:43,808][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:20:44,126][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:20:44,444][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:20:44,764][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:20:45,083][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:20:45,402][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:20:45,721][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:20:46,040][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:20:46,360][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:20:46,679][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:20:46,998][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:20:47,317][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:20:47,636][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:20:47,955][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:20:48,275][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:20:48,594][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:20:48,913][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:20:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:20:49,552][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:20:49,871][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:20:50,190][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:20:50,509][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:20:50,829][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:20:51,452][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:20:51,772][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:20:52,093][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:20:52,412][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:20:52,732][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:20:53,051][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:20:53,371][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:20:53,691][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:20:54,009][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:20:54,328][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:20:54,647][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:20:54,965][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:20:55,284][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:20:55,959][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:20:56,710][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:20:56,712][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:20:56,714][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:20:57,365][__main__][INFO] - Iteration 357 took 27s (11.77% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 47m 21s. Estimated total time: 7h 33m 14s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 19s, 500 more iterations: 3h 46m 37s. [2026-03-25 18:20:57,367][__main__][INFO] - Starting iteration 357. [2026-03-25 18:20:57,370][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:20:57,370][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:21:00,593][__main__][INFO] - Number of regex retries in iteration 357: 0 [2026-03-25 18:21:00,594][__main__][INFO] - agents played in iteration 357 are Alice, Bob [2026-03-25 18:21:01,118][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:21:01,771][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:21:02,060][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:21:02,380][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:21:02,700][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:21:03,020][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:21:03,338][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:21:03,657][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:21:03,976][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:21:04,295][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:21:04,614][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:21:04,934][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:21:05,253][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:21:05,572][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:21:05,891][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:21:06,210][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:21:06,530][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:21:06,848][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:21:07,167][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:21:07,486][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:21:07,805][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:21:08,125][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:21:08,444][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:21:08,762][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:21:09,081][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:21:09,400][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:21:09,719][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:21:10,039][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:21:10,358][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:21:10,677][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:21:10,996][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:21:11,315][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:21:11,634][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:21:11,953][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:21:12,273][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:21:12,592][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:21:12,912][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:21:13,231][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:21:13,551][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:21:13,870][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:21:14,189][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:21:14,508][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:21:14,827][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:21:15,147][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:21:15,466][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:21:15,785][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:21:16,103][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:21:16,422][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:21:16,741][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:21:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:21:17,378][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:21:17,697][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:21:18,016][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:21:18,628][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:21:18,947][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:21:19,265][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:21:19,585][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:21:19,905][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:21:20,224][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:21:20,543][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:21:20,863][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:21:21,182][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:21:21,500][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:21:21,818][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:21:22,138][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:21:22,457][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:21:23,111][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:21:23,875][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:21:23,877][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:21:23,879][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:21:24,541][__main__][INFO] - Iteration 358 took 27s (11.86% Gen, 85.69% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 46m 32s. Estimated total time: 7h 32m 52s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 26s. [2026-03-25 18:21:24,544][__main__][INFO] - Starting iteration 358. [2026-03-25 18:21:24,547][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:21:24,547][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:21:27,817][__main__][INFO] - Number of regex retries in iteration 358: 0 [2026-03-25 18:21:27,818][__main__][INFO] - agents played in iteration 358 are Alice, Bob [2026-03-25 18:21:28,348][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:21:29,014][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:21:29,304][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:21:29,623][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:21:29,943][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:21:30,263][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:21:30,582][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:21:30,901][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:21:31,220][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:21:31,540][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:21:31,861][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:21:32,181][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:21:32,500][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:21:32,820][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:21:33,139][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:21:33,458][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:21:33,776][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:21:34,095][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:21:34,414][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:21:34,733][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:21:35,052][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:21:35,371][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:21:35,691][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:21:36,009][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:21:36,328][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:21:36,646][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:21:36,964][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:21:37,282][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:21:37,601][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:21:37,920][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:21:38,238][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:21:38,557][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:21:38,877][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:21:39,195][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:21:39,513][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:21:39,833][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:21:40,152][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:21:40,472][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:21:40,791][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:21:41,111][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:21:41,429][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:21:41,748][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:21:42,067][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:21:42,386][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:21:42,705][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:21:43,024][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:21:43,343][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:21:43,662][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:21:43,981][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:21:44,300][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:21:44,620][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:21:44,940][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:21:45,259][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:21:45,872][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:21:46,191][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:21:46,510][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:21:46,829][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:21:47,149][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:21:47,468][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:21:47,787][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:21:48,107][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:21:48,425][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:21:48,744][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:21:49,063][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:21:49,382][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:21:49,701][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:21:50,357][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:21:51,065][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:21:51,067][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:21:51,069][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:21:51,719][__main__][INFO] - Iteration 359 took 27s (12.04% Gen, 85.57% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 46m 6s. Estimated total time: 7h 32m 53s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 26s. [2026-03-25 18:21:51,721][__main__][INFO] - Starting iteration 359. [2026-03-25 18:21:51,725][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:21:51,725][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:21:54,965][__main__][INFO] - Number of regex retries in iteration 359: 0 [2026-03-25 18:21:54,966][__main__][INFO] - agents played in iteration 359 are Alice, Bob [2026-03-25 18:21:55,497][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:21:56,150][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:21:56,438][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:21:56,758][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:21:57,078][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:21:57,397][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:21:57,716][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:21:58,035][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:21:58,354][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:21:58,675][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:21:58,995][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:21:59,315][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:21:59,634][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:21:59,953][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:22:00,273][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:22:00,593][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:22:00,913][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:22:01,231][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:22:01,551][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:22:01,870][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:22:02,189][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:22:02,507][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:22:02,826][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:22:03,145][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:22:03,463][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:22:03,783][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:22:04,102][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:22:04,421][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:22:04,741][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:22:05,059][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:22:05,377][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:22:05,696][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:22:06,016][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:22:06,334][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:22:06,653][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:22:06,972][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:22:07,291][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:22:07,610][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:22:07,928][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:22:08,247][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:22:08,566][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:22:08,885][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:22:09,204][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:22:09,524][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:22:09,842][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:22:10,161][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:22:10,481][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:22:10,800][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:22:11,118][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:22:11,437][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:22:11,755][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:22:12,075][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:22:12,394][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:22:13,008][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:22:13,326][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:22:13,645][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:22:13,965][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:22:14,284][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:22:14,602][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:22:14,922][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:22:15,240][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:22:15,558][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:22:15,876][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:22:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:22:16,514][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:22:16,833][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:22:17,492][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:22:18,235][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:22:18,237][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:22:18,239][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:22:18,894][__main__][INFO] - Iteration 360 took 27s (11.93% Gen, 85.65% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 45m 36s. Estimated total time: 7h 32m 51s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 25s. [2026-03-25 18:22:18,897][__main__][INFO] - Starting iteration 360. [2026-03-25 18:22:18,899][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:22:18,900][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:22:22,058][__main__][INFO] - Number of regex retries in iteration 360: 0 [2026-03-25 18:22:22,059][__main__][INFO] - agents played in iteration 360 are Alice, Bob [2026-03-25 18:22:22,589][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:22:23,241][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:22:23,531][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:22:23,851][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:22:24,170][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:22:24,488][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:22:24,807][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:22:25,125][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:22:25,444][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:22:25,762][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:22:26,081][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:22:26,399][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:22:26,717][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:22:27,035][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:22:27,354][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:22:27,674][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:22:27,993][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:22:28,312][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:22:28,631][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:22:28,950][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:22:29,269][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:22:29,588][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:22:29,906][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:22:30,226][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:22:30,545][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:22:30,864][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:22:31,183][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:22:31,501][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:22:31,820][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:22:32,139][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:22:32,458][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:22:32,777][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:22:33,097][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:22:33,416][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:22:33,736][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:22:34,055][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:22:34,373][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:22:34,692][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:22:35,010][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:22:35,329][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:22:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:22:35,966][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:22:36,284][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:22:36,602][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:22:36,921][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:22:37,240][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:22:37,560][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:22:37,878][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:22:38,197][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:22:38,515][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:22:38,833][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:22:39,153][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:22:39,474][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:22:40,091][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:22:40,410][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:22:40,729][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:22:41,049][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:22:41,368][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:22:41,687][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:22:42,008][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:22:42,328][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:22:42,648][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:22:42,968][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:22:43,288][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:22:43,607][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:22:43,926][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:22:44,592][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:22:45,337][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:22:45,339][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:22:45,340][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:22:45,988][__main__][INFO] - Iteration 361 took 27s (11.66% Gen, 85.94% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 43m 48s. Estimated total time: 7h 31m 30s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 9s, 500 more iterations: 3h 45m 45s. [2026-03-25 18:22:45,991][__main__][INFO] - Starting iteration 361. [2026-03-25 18:22:45,994][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:22:45,994][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:22:49,165][__main__][INFO] - Number of regex retries in iteration 361: 0 [2026-03-25 18:22:49,166][__main__][INFO] - agents played in iteration 361 are Alice, Bob [2026-03-25 18:22:49,694][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:22:50,346][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:22:50,635][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:22:50,954][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:22:51,274][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:22:51,593][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:22:51,912][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:22:52,231][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:22:52,550][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:22:52,869][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:22:53,187][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:22:53,507][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:22:53,826][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:22:54,144][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:22:54,463][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:22:54,781][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:22:55,099][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:22:55,418][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:22:55,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:22:56,055][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:22:56,374][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:22:56,692][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:22:57,011][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:22:57,331][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:22:57,649][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:22:57,969][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:22:58,287][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:22:58,606][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:22:58,926][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:22:59,244][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:22:59,563][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:22:59,881][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:23:00,200][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:23:00,518][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:23:00,836][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:23:01,155][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:23:01,474][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:23:01,792][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:23:02,112][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:23:02,430][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:23:02,748][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:23:03,066][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:23:03,385][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:23:03,705][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:23:04,024][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:23:04,342][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:23:04,661][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:23:04,980][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:23:05,300][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:23:05,620][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:23:05,939][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:23:06,258][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:23:06,576][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:23:07,189][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:23:07,508][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:23:07,827][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:23:08,146][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:23:08,465][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:23:08,783][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:23:09,103][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:23:09,421][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:23:09,740][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:23:10,059][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:23:10,379][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:23:10,698][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:23:11,016][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:23:11,679][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:23:12,415][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:23:12,417][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:23:12,419][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:23:13,068][__main__][INFO] - Iteration 362 took 27s (11.71% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 43m 6s. Estimated total time: 7h 31m 14s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 7s, 500 more iterations: 3h 45m 37s. [2026-03-25 18:23:13,070][__main__][INFO] - Starting iteration 362. [2026-03-25 18:23:13,073][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:23:13,073][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:23:16,235][__main__][INFO] - Number of regex retries in iteration 362: 0 [2026-03-25 18:23:16,236][__main__][INFO] - agents played in iteration 362 are Alice, Bob [2026-03-25 18:23:16,760][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:23:17,420][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:23:17,710][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:23:18,029][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:23:18,347][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:23:18,666][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:23:18,985][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:23:19,303][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:23:19,622][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:23:19,940][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:23:20,259][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:23:20,577][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:23:20,895][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:23:21,214][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:23:21,533][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:23:21,853][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:23:22,172][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:23:22,490][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:23:22,809][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:23:23,128][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:23:23,446][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:23:23,765][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:23:24,084][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:23:24,404][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:23:24,724][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:23:25,043][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:23:25,362][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:23:25,680][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:23:25,998][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:23:26,316][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:23:26,635][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:23:26,954][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:23:27,273][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:23:27,593][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:23:27,911][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:23:28,230][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:23:28,549][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:23:28,868][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:23:29,187][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:23:29,505][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:23:29,823][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:23:30,141][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:23:30,460][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:23:30,778][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:23:31,096][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:23:31,415][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:23:31,733][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:23:32,052][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:23:32,370][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:23:32,689][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:23:33,007][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:23:33,325][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:23:33,644][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:23:34,260][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:23:34,579][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:23:34,898][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:23:35,218][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:23:35,536][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:23:35,855][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:23:36,174][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:23:36,494][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:23:36,812][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:23:37,131][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:23:37,450][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:23:37,769][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:23:38,087][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:23:38,750][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:23:39,484][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:23:39,486][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:23:39,488][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:23:40,136][__main__][INFO] - Iteration 363 took 27s (11.69% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 42m 28s. Estimated total time: 7h 31m 3s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 6s, 500 more iterations: 3h 45m 31s. [2026-03-25 18:23:40,138][__main__][INFO] - Starting iteration 363. [2026-03-25 18:23:40,141][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:23:40,141][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:23:43,266][__main__][INFO] - Number of regex retries in iteration 363: 0 [2026-03-25 18:23:43,267][__main__][INFO] - agents played in iteration 363 are Alice, Bob [2026-03-25 18:23:43,791][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:23:44,450][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:23:44,740][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:23:45,059][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:23:45,377][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:23:45,696][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:23:46,014][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:23:46,332][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:23:46,650][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:23:46,969][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:23:47,287][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:23:47,605][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:23:47,923][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:23:48,241][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:23:48,561][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:23:48,880][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:23:49,198][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:23:49,518][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:23:49,837][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:23:50,157][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:23:50,476][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:23:50,796][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:23:51,116][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:23:51,435][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:23:51,755][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:23:52,074][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:23:52,394][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:23:52,713][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:23:53,031][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:23:53,350][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:23:53,669][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:23:53,989][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:23:54,309][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:23:54,628][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:23:54,947][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:23:55,267][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:23:55,586][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:23:55,905][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:23:56,226][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:23:56,546][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:23:56,866][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:23:57,186][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:23:57,506][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:23:57,825][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:23:58,145][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:23:58,465][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:23:58,785][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:23:59,104][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:23:59,423][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:23:59,741][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:24:00,061][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:24:00,381][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:24:00,700][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:24:01,317][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:24:01,636][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:24:01,955][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:24:02,274][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:24:02,592][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:24:02,910][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:24:03,229][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:24:03,547][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:24:03,866][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:24:04,185][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:24:04,503][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:24:04,822][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:24:05,140][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:24:05,804][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:24:06,537][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:24:06,539][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:24:06,541][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:24:07,235][__main__][INFO] - Iteration 364 took 27s (11.53% Gen, 85.90% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 42m 32s. Estimated total time: 7h 31m 35s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 9s, 500 more iterations: 3h 45m 47s. [2026-03-25 18:24:07,237][__main__][INFO] - Starting iteration 364. [2026-03-25 18:24:07,240][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:24:07,241][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:24:10,408][__main__][INFO] - Number of regex retries in iteration 364: 0 [2026-03-25 18:24:10,409][__main__][INFO] - agents played in iteration 364 are Alice, Bob [2026-03-25 18:24:10,934][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:24:11,585][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:24:11,875][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:24:12,195][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:24:12,513][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:24:12,832][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:24:13,150][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:24:13,470][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:24:13,789][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:24:14,107][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:24:14,426][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:24:14,744][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:24:15,062][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:24:15,380][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:24:15,698][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:24:16,017][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:24:16,337][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:24:16,657][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:24:16,975][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:24:17,294][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:24:17,613][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:24:17,931][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:24:18,250][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:24:18,568][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:24:18,887][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:24:19,205][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:24:19,523][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:24:19,841][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:24:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:24:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:24:20,796][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:24:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:24:21,434][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:24:21,753][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:24:22,072][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:24:22,391][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:24:22,711][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:24:23,029][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:24:23,348][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:24:23,667][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:24:23,986][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:24:24,305][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:24:24,623][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:24:24,941][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:24:25,259][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:24:25,578][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:24:25,897][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:24:26,216][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:24:26,535][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:24:26,853][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:24:27,172][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:24:27,490][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:24:27,808][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:24:28,421][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:24:28,739][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:24:29,058][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:24:29,377][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:24:29,695][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:24:30,014][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:24:30,333][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:24:30,653][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:24:30,971][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:24:31,290][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:24:31,608][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:24:31,927][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:24:32,245][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:24:32,902][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:24:33,634][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:24:33,636][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:24:33,637][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:24:34,283][__main__][INFO] - Iteration 365 took 27s (11.71% Gen, 85.89% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 41m 13s. Estimated total time: 7h 30m 43s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 4s, 500 more iterations: 3h 45m 21s. [2026-03-25 18:24:34,285][__main__][INFO] - Starting iteration 365. [2026-03-25 18:24:34,288][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:24:34,288][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:24:37,462][__main__][INFO] - Number of regex retries in iteration 365: 0 [2026-03-25 18:24:37,463][__main__][INFO] - agents played in iteration 365 are Alice, Bob [2026-03-25 18:24:37,997][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:24:38,649][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:24:38,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:24:39,258][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:24:39,576][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:24:39,895][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:24:40,213][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:24:40,532][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:24:40,851][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:24:41,169][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:24:41,488][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:24:41,807][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:24:42,125][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:24:42,443][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:24:42,761][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:24:43,080][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:24:43,399][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:24:43,718][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:24:44,036][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:24:44,354][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:24:44,673][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:24:44,992][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:24:45,311][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:24:45,629][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:24:45,948][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:24:46,267][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:24:46,585][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:24:46,903][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:24:47,221][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:24:47,539][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:24:47,858][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:24:48,176][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:24:48,494][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:24:48,812][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:24:49,131][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:24:49,451][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:24:49,769][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:24:50,087][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:24:50,405][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:24:50,725][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:24:51,044][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:24:51,362][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:24:51,680][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:24:51,998][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:24:52,317][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:24:52,636][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:24:52,954][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:24:53,272][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:24:53,591][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:24:53,910][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:24:54,230][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:24:54,549][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:24:54,868][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:24:55,480][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:24:55,798][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:24:56,116][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:24:56,434][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:24:56,752][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:24:57,070][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:24:57,388][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:24:57,706][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:24:58,025][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:24:58,343][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:24:58,661][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:24:58,979][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:24:59,298][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:24:59,955][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:25:00,684][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:25:00,686][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:25:00,688][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:25:01,335][__main__][INFO] - Iteration 366 took 27s (11.74% Gen, 85.87% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 40m 51s. Estimated total time: 7h 30m 48s. Time estimates for 10 more iterations: 4m 30s, 100 more iterations: 45m 4s, 500 more iterations: 3h 45m 24s. [2026-03-25 18:25:01,337][__main__][INFO] - Starting iteration 366. [2026-03-25 18:25:01,340][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:25:01,340][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:25:04,516][__main__][INFO] - Number of regex retries in iteration 366: 0 [2026-03-25 18:25:04,517][__main__][INFO] - agents played in iteration 366 are Alice, Bob [2026-03-25 18:25:05,058][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:25:05,713][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:25:06,002][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:25:06,322][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:25:06,642][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:25:06,961][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:25:07,281][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:25:07,601][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:25:07,920][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:25:08,240][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:25:08,561][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:25:08,881][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:25:09,200][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:25:09,519][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:25:09,839][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:25:10,159][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:25:10,479][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:25:10,798][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:25:11,117][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:25:11,438][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:25:11,757][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:25:12,077][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:25:12,396][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:25:12,715][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:25:13,034][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:25:13,353][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:25:13,673][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:25:13,994][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:25:14,314][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:25:14,632][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:25:14,951][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:25:15,270][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:25:15,588][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:25:15,906][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:25:16,224][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:25:16,542][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:25:16,861][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:25:17,180][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:25:17,499][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:25:17,817][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:25:18,135][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:25:18,455][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:25:18,774][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:25:19,093][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:25:19,412][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:25:19,732][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:25:20,051][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:25:20,370][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:25:20,688][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:25:21,006][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:25:21,324][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:25:21,642][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:25:21,962][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:25:22,579][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:25:22,897][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:25:23,217][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:25:23,536][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:25:23,855][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:25:24,173][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:25:24,492][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:25:24,810][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:25:25,129][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:25:25,447][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:25:25,766][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:25:26,084][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:25:26,402][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:25:27,074][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:25:27,818][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:25:27,820][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:25:27,822][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:25:28,467][__main__][INFO] - Iteration 367 took 27s (11.71% Gen, 85.91% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 41m 43s. Estimated total time: 7h 32m 8s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 12s, 500 more iterations: 3h 46m 4s. [2026-03-25 18:25:28,469][__main__][INFO] - Starting iteration 367. [2026-03-25 18:25:28,472][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:25:28,472][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:25:31,641][__main__][INFO] - Number of regex retries in iteration 367: 0 [2026-03-25 18:25:31,641][__main__][INFO] - agents played in iteration 367 are Alice, Bob [2026-03-25 18:25:32,166][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:25:32,826][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:25:33,114][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:25:33,434][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:25:33,753][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:25:34,071][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:25:34,389][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:25:34,708][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:25:35,027][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:25:35,345][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:25:35,663][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:25:35,982][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:25:36,300][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:25:36,619][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:25:36,938][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:25:37,257][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:25:37,575][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:25:37,894][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:25:38,213][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:25:38,532][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:25:38,851][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:25:39,170][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:25:39,488][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:25:39,806][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:25:40,125][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:25:40,443][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:25:40,763][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:25:41,082][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:25:41,400][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:25:41,718][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:25:42,038][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:25:42,356][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:25:42,675][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:25:42,994][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:25:43,314][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:25:43,634][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:25:43,952][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:25:44,271][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:25:44,590][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:25:44,909][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:25:45,228][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:25:45,547][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:25:45,865][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:25:46,185][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:25:46,503][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:25:46,823][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:25:47,142][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:25:47,461][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:25:47,780][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:25:48,098][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:25:48,417][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:25:48,736][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:25:49,055][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:25:49,672][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:25:49,991][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:25:50,310][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:25:50,628][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:25:50,946][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:25:51,265][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:25:51,583][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:25:51,901][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:25:52,220][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:25:52,538][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:25:52,856][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:25:53,174][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:25:53,493][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:25:54,156][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:25:54,894][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:25:54,896][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:25:54,898][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:25:55,596][__main__][INFO] - Iteration 368 took 27s (11.68% Gen, 85.74% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 41m 14s. Estimated total time: 7h 32m 5s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 12s, 500 more iterations: 3h 46m 2s. [2026-03-25 18:25:55,599][__main__][INFO] - Starting iteration 368. [2026-03-25 18:25:55,602][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:25:55,602][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:25:58,778][__main__][INFO] - Number of regex retries in iteration 368: 0 [2026-03-25 18:25:58,779][__main__][INFO] - agents played in iteration 368 are Alice, Bob [2026-03-25 18:25:59,305][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:25:59,965][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:26:00,255][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:26:00,575][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:26:00,894][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:26:01,213][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:26:01,531][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:26:01,849][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:26:02,168][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:26:02,486][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:26:02,805][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:26:03,123][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:26:03,443][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:26:03,761][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:26:04,080][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:26:04,398][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:26:04,718][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:26:05,036][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:26:05,356][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:26:05,677][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:26:05,996][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:26:06,315][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:26:06,635][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:26:06,954][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:26:07,274][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:26:07,592][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:26:07,911][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:26:08,230][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:26:08,549][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:26:08,867][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:26:09,186][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:26:09,504][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:26:09,823][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:26:10,141][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:26:10,459][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:26:10,777][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:26:11,096][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:26:11,414][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:26:11,734][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:26:12,053][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:26:12,372][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:26:12,691][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:26:13,010][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:26:13,328][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:26:13,646][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:26:13,964][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:26:14,283][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:26:14,601][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:26:14,920][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:26:15,239][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:26:15,558][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:26:15,877][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:26:16,195][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:26:16,812][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:26:17,131][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:26:17,450][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:26:17,768][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:26:18,087][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:26:18,406][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:26:18,725][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:26:19,044][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:26:19,364][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:26:19,683][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:26:20,002][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:26:20,321][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:26:20,640][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:26:21,314][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:26:22,052][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:26:22,054][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:26:22,055][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:26:22,705][__main__][INFO] - Iteration 369 took 27s (11.72% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 40m 26s. Estimated total time: 7h 31m 44s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 10s, 500 more iterations: 3h 45m 52s. [2026-03-25 18:26:22,709][__main__][INFO] - Starting iteration 369. [2026-03-25 18:26:22,712][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:26:22,712][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:26:25,928][__main__][INFO] - Number of regex retries in iteration 369: 0 [2026-03-25 18:26:25,929][__main__][INFO] - agents played in iteration 369 are Alice, Bob [2026-03-25 18:26:26,468][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:26:27,129][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:26:27,418][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:26:27,737][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:26:28,057][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:26:28,377][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:26:28,697][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:26:29,016][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:26:29,335][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:26:29,654][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:26:29,973][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:26:30,292][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:26:30,610][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:26:30,928][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:26:31,246][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:26:31,565][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:26:31,883][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:26:32,203][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:26:32,521][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:26:32,841][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:26:33,160][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:26:33,479][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:26:33,798][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:26:34,116][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:26:34,435][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:26:34,752][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:26:35,071][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:26:35,390][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:26:35,709][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:26:36,027][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:26:36,345][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:26:36,663][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:26:36,982][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:26:37,300][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:26:37,619][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:26:37,937][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:26:38,257][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:26:38,575][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:26:38,893][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:26:39,211][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:26:39,530][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:26:39,849][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:26:40,167][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:26:40,486][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:26:40,804][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:26:41,122][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:26:41,440][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:26:41,758][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:26:42,076][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:26:42,395][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:26:42,714][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:26:43,033][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:26:43,351][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:26:43,967][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:26:44,286][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:26:44,604][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:26:44,922][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:26:45,240][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:26:45,559][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:26:45,877][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:26:46,196][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:26:46,516][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:26:46,835][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:26:47,153][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:26:47,473][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:26:47,791][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:26:48,455][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:26:49,197][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:26:49,199][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:26:49,201][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:26:49,850][__main__][INFO] - Iteration 370 took 27s (11.85% Gen, 85.75% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 40m 33s. Estimated total time: 7h 32m 19s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 9s. [2026-03-25 18:26:49,852][__main__][INFO] - Starting iteration 370. [2026-03-25 18:26:49,856][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:26:49,856][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:26:53,061][__main__][INFO] - Number of regex retries in iteration 370: 0 [2026-03-25 18:26:53,062][__main__][INFO] - agents played in iteration 370 are Alice, Bob [2026-03-25 18:26:53,593][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:26:54,253][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:26:54,541][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:26:54,860][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:26:55,178][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:26:55,496][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:26:55,815][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:26:56,133][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:26:56,453][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:26:56,771][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:26:57,090][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:26:57,409][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:26:57,728][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:26:58,046][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:26:58,364][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:26:58,682][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:26:59,001][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:26:59,320][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:26:59,640][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:26:59,958][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:27:00,277][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:27:00,595][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:27:00,915][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:27:01,234][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:27:01,554][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:27:01,874][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:27:02,193][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:27:02,512][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:27:02,830][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:27:03,149][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:27:03,468][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:27:03,787][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:27:04,105][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:27:04,424][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:27:04,743][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:27:05,061][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:27:05,381][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:27:05,699][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:27:06,018][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:27:06,337][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:27:06,656][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:27:06,974][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:27:07,294][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:27:07,613][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:27:07,932][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:27:08,251][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:27:08,570][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:27:08,888][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:27:09,207][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:27:09,526][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:27:09,845][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:27:10,163][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:27:10,483][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:27:11,098][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:27:11,418][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:27:11,737][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:27:12,055][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:27:12,373][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:27:12,692][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:27:13,012][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:27:13,330][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:27:13,649][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:27:13,969][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:27:14,288][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:27:14,606][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:27:14,925][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:27:15,589][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:27:16,331][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:27:16,333][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:27:16,335][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:27:17,091][__main__][INFO] - Iteration 371 took 27s (11.77% Gen, 85.44% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 41m 44s. Estimated total time: 7h 33m 56s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 23s, 500 more iterations: 3h 46m 58s. [2026-03-25 18:27:17,093][__main__][INFO] - Starting iteration 371. [2026-03-25 18:27:17,096][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:27:17,096][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:27:20,298][__main__][INFO] - Number of regex retries in iteration 371: 0 [2026-03-25 18:27:20,299][__main__][INFO] - agents played in iteration 371 are Alice, Bob [2026-03-25 18:27:20,831][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:27:21,491][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:27:21,779][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:27:22,099][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:27:22,416][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:27:22,735][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:27:23,054][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:27:23,373][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:27:23,693][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:27:24,012][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:27:24,330][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:27:24,649][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:27:24,968][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:27:25,286][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:27:25,606][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:27:25,925][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:27:26,243][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:27:26,561][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:27:26,880][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:27:27,199][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:27:27,517][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:27:27,835][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:27:28,154][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:27:28,473][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:27:28,793][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:27:29,112][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:27:29,431][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:27:29,749][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:27:30,068][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:27:30,387][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:27:30,706][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:27:31,025][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:27:31,344][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:27:31,662][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:27:31,982][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:27:32,301][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:27:32,619][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:27:32,939][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:27:33,258][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:27:33,579][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:27:33,898][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:27:34,217][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:27:34,536][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:27:34,854][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:27:35,173][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:27:35,493][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:27:35,814][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:27:36,133][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:27:36,452][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:27:36,771][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:27:37,090][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:27:37,409][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:27:37,728][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:27:38,349][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:27:38,668][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:27:38,988][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:27:39,307][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:27:39,626][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:27:39,945][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:27:40,264][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:27:40,583][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:27:40,902][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:27:41,221][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:27:41,540][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:27:41,859][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:27:42,178][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:27:42,859][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:27:43,593][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:27:43,595][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:27:43,597][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:27:44,238][__main__][INFO] - Iteration 372 took 27s (11.80% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 39m 44s. Estimated total time: 7h 32m 23s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 14s, 500 more iterations: 3h 46m 11s. [2026-03-25 18:27:44,241][__main__][INFO] - Starting iteration 372. [2026-03-25 18:27:44,244][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:27:44,244][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:27:47,432][__main__][INFO] - Number of regex retries in iteration 372: 0 [2026-03-25 18:27:47,433][__main__][INFO] - agents played in iteration 372 are Alice, Bob [2026-03-25 18:27:47,964][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:27:48,624][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:27:48,914][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:27:49,233][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:27:49,551][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:27:49,870][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:27:50,188][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:27:50,507][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:27:50,825][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:27:51,143][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:27:51,462][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:27:51,781][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:27:52,100][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:27:52,419][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:27:52,738][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:27:53,057][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:27:53,376][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:27:53,696][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:27:54,013][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:27:54,332][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:27:54,651][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:27:54,970][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:27:55,290][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:27:55,609][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:27:55,927][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:27:56,245][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:27:56,563][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:27:56,882][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:27:57,201][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:27:57,520][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:27:57,839][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:27:58,159][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:27:58,478][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:27:58,797][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:27:59,116][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:27:59,436][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:27:59,756][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:28:00,075][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:28:00,394][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:28:00,712][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:28:01,031][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:28:01,351][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:28:01,671][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:28:01,990][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:28:02,308][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:28:02,627][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:28:02,945][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:28:03,263][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:28:03,582][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:28:03,901][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:28:04,220][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:28:04,538][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:28:04,857][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:28:05,474][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:28:05,794][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:28:06,114][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:28:06,433][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:28:06,752][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:28:07,071][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:28:07,391][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:28:07,710][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:28:08,030][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:28:08,349][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:28:08,668][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:28:08,986][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:28:09,304][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:28:09,968][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:28:10,708][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:28:10,710][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:28:10,712][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:28:11,428][__main__][INFO] - Iteration 373 took 27s (11.73% Gen, 85.63% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 39m 58s. Estimated total time: 7h 33m 5s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 18s, 500 more iterations: 3h 46m 32s. [2026-03-25 18:28:11,431][__main__][INFO] - Starting iteration 373. [2026-03-25 18:28:11,433][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:28:11,434][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:28:14,627][__main__][INFO] - Number of regex retries in iteration 373: 0 [2026-03-25 18:28:14,628][__main__][INFO] - agents played in iteration 373 are Alice, Bob [2026-03-25 18:28:15,157][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:28:15,815][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:28:16,104][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:28:16,423][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:28:16,742][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:28:17,060][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:28:17,379][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:28:17,697][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:28:18,016][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:28:18,335][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:28:18,653][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:28:18,972][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:28:19,290][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:28:19,609][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:28:19,928][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:28:20,248][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:28:20,566][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:28:20,885][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:28:21,204][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:28:21,523][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:28:21,843][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:28:22,162][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:28:22,482][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:28:22,801][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:28:23,121][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:28:23,440][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:28:23,759][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:28:24,078][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:28:24,396][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:28:24,715][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:28:25,033][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:28:25,352][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:28:25,671][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:28:25,989][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:28:26,308][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:28:26,627][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:28:26,945][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:28:27,265][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:28:27,583][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:28:27,902][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:28:28,221][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:28:28,540][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:28:28,860][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:28:29,178][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:28:29,496][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:28:29,815][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:28:30,133][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:28:30,451][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:28:30,769][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:28:31,089][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:28:31,409][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:28:31,728][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:28:32,046][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:28:32,662][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:28:32,980][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:28:33,299][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:28:33,618][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:28:33,937][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:28:34,256][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:28:34,575][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:28:34,894][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:28:35,214][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:28:35,533][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:28:35,852][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:28:36,171][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:28:36,491][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:28:37,152][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:28:37,918][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:28:37,920][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:28:37,922][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:28:38,535][__main__][INFO] - Iteration 374 took 27s (11.78% Gen, 85.95% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 38m 8s. Estimated total time: 7h 31m 42s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 10s, 500 more iterations: 3h 45m 51s. [2026-03-25 18:28:38,537][__main__][INFO] - Starting iteration 374. [2026-03-25 18:28:38,540][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:28:38,541][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:28:41,743][__main__][INFO] - Number of regex retries in iteration 374: 0 [2026-03-25 18:28:41,744][__main__][INFO] - agents played in iteration 374 are Alice, Bob [2026-03-25 18:28:42,273][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:28:42,931][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:28:43,219][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:28:43,538][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:28:43,856][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:28:44,173][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:28:44,492][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:28:44,811][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:28:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:28:45,447][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:28:45,766][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:28:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:28:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:28:46,722][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:28:47,042][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:28:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:28:47,679][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:28:47,998][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:28:48,318][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:28:48,638][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:28:48,957][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:28:49,277][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:28:49,596][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:28:49,915][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:28:50,235][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:28:50,556][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:28:50,876][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:28:51,196][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:28:51,516][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:28:51,835][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:28:52,155][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:28:52,475][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:28:52,795][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:28:53,115][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:28:53,435][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:28:53,754][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:28:54,073][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:28:54,392][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:28:54,712][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:28:55,032][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:28:55,352][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:28:55,671][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:28:55,990][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:28:56,309][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:28:56,629][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:28:56,948][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:28:57,267][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:28:57,586][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:28:57,905][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:28:58,223][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:28:58,541][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:28:58,859][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:28:59,179][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:28:59,794][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:29:00,114][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:29:00,433][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:29:00,752][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:29:01,072][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:29:01,391][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:29:01,709][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:29:02,028][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:29:02,346][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:29:02,664][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:29:02,984][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:29:03,302][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:29:03,621][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:29:04,282][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:29:05,031][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:29:05,033][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:29:05,035][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:29:05,675][__main__][INFO] - Iteration 375 took 27s (11.80% Gen, 85.83% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 38m 14s. Estimated total time: 7h 32m 15s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 7s. [2026-03-25 18:29:05,677][__main__][INFO] - Starting iteration 375. [2026-03-25 18:29:05,680][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:29:05,681][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:29:08,874][__main__][INFO] - Number of regex retries in iteration 375: 0 [2026-03-25 18:29:08,875][__main__][INFO] - agents played in iteration 375 are Alice, Bob [2026-03-25 18:29:09,400][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:29:10,058][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:29:10,348][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:29:10,666][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:29:10,984][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:29:11,302][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:29:11,621][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:29:11,940][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:29:12,259][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:29:12,578][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:29:12,898][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:29:13,216][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:29:13,536][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:29:13,855][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:29:14,175][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:29:14,495][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:29:14,815][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:29:15,135][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:29:15,454][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:29:15,773][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:29:16,091][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:29:16,409][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:29:16,728][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:29:17,047][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:29:17,365][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:29:17,684][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:29:18,004][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:29:18,322][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:29:18,640][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:29:18,959][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:29:19,280][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:29:19,599][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:29:19,918][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:29:20,237][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:29:20,556][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:29:20,875][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:29:21,194][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:29:21,513][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:29:21,832][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:29:22,150][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:29:22,468][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:29:22,788][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:29:23,106][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:29:23,424][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:29:23,743][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:29:24,062][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:29:24,380][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:29:24,699][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:29:25,018][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:29:25,337][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:29:25,656][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:29:25,975][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:29:26,294][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:29:26,908][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:29:27,227][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:29:27,545][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:29:27,864][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:29:28,184][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:29:28,502][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:29:28,821][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:29:29,140][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:29:29,459][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:29:29,779][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:29:30,098][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:29:30,417][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:29:30,735][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:29:31,397][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:29:32,141][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:29:32,143][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:29:32,145][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:29:32,913][__main__][INFO] - Iteration 376 took 27s (11.73% Gen, 85.45% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 39m 25s. Estimated total time: 7h 33m 53s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 23s, 500 more iterations: 3h 46m 56s. [2026-03-25 18:29:32,915][__main__][INFO] - Starting iteration 376. [2026-03-25 18:29:32,918][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:29:32,918][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:29:36,101][__main__][INFO] - Number of regex retries in iteration 376: 0 [2026-03-25 18:29:36,102][__main__][INFO] - agents played in iteration 376 are Alice, Bob [2026-03-25 18:29:36,636][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:29:37,297][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:29:37,586][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:29:37,905][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:29:38,223][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:29:38,542][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:29:38,860][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:29:39,179][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:29:39,497][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:29:39,816][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:29:40,135][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:29:40,455][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:29:40,774][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:29:41,093][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:29:41,413][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:29:41,731][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:29:42,049][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:29:42,368][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:29:42,687][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:29:43,006][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:29:43,326][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:29:43,645][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:29:43,965][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:29:44,284][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:29:44,603][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:29:44,922][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:29:45,240][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:29:45,558][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:29:45,877][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:29:46,196][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:29:46,515][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:29:46,835][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:29:47,155][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:29:47,475][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:29:47,794][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:29:48,113][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:29:48,432][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:29:48,753][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:29:49,073][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:29:49,393][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:29:49,713][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:29:50,031][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:29:50,349][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:29:50,667][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:29:50,986][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:29:51,305][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:29:51,623][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:29:51,941][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:29:52,260][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:29:52,579][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:29:52,898][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:29:53,217][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:29:53,536][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:29:54,153][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:29:54,472][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:29:54,792][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:29:55,112][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:29:55,431][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:29:55,750][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:29:56,069][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:29:56,387][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:29:56,705][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:29:57,025][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:29:57,344][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:29:57,663][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:29:57,981][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:29:58,646][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:29:59,387][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:29:59,389][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:29:59,391][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:30:00,034][__main__][INFO] - Iteration 377 took 27s (11.74% Gen, 85.88% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 37m 1s. Estimated total time: 7h 31m 56s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 11s, 500 more iterations: 3h 45m 58s. [2026-03-25 18:30:00,036][__main__][INFO] - Starting iteration 377. [2026-03-25 18:30:00,039][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:30:00,039][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:30:03,293][__main__][INFO] - Number of regex retries in iteration 377: 0 [2026-03-25 18:30:03,294][__main__][INFO] - agents played in iteration 377 are Alice, Bob [2026-03-25 18:30:03,843][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:30:04,512][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:30:04,802][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:30:05,123][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:30:05,442][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:30:05,762][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:30:06,080][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:30:06,401][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:30:06,720][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:30:07,039][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:30:07,359][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:30:07,679][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:30:08,000][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:30:08,319][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:30:08,638][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:30:08,958][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:30:09,277][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:30:09,597][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:30:09,916][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:30:10,236][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:30:10,555][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:30:10,876][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:30:11,197][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:30:11,516][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:30:11,836][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:30:12,155][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:30:12,475][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:30:12,795][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:30:13,114][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:30:13,433][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:30:13,752][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:30:14,071][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:30:14,390][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:30:14,709][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:30:15,027][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:30:15,345][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:30:15,664][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:30:15,983][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:30:16,301][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:30:16,620][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:30:16,939][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:30:17,258][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:30:17,577][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:30:17,896][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:30:18,216][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:30:18,535][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:30:18,854][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:30:19,173][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:30:19,492][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:30:19,811][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:30:20,130][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:30:20,449][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:30:20,768][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:30:21,386][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:30:21,704][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:30:22,022][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:30:22,341][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:30:22,659][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:30:22,979][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:30:23,299][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:30:23,617][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:30:23,936][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:30:24,256][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:30:24,575][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:30:24,896][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:30:25,214][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:30:25,879][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:30:26,624][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:30:26,626][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:30:26,628][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:30:27,323][__main__][INFO] - Iteration 378 took 27s (11.93% Gen, 85.52% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 39m 22s. Estimated total time: 7h 34m 45s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 28s, 500 more iterations: 3h 47m 22s. [2026-03-25 18:30:27,325][__main__][INFO] - Starting iteration 378. [2026-03-25 18:30:27,328][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:30:27,328][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:30:30,544][__main__][INFO] - Number of regex retries in iteration 378: 0 [2026-03-25 18:30:30,545][__main__][INFO] - agents played in iteration 378 are Alice, Bob [2026-03-25 18:30:31,084][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:30:31,756][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:30:32,045][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:30:32,364][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:30:32,684][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:30:33,002][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:30:33,321][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:30:33,640][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:30:33,959][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:30:34,278][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:30:34,597][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:30:34,916][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:30:35,235][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:30:35,556][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:30:35,875][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:30:36,194][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:30:36,514][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:30:36,834][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:30:37,153][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:30:37,472][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:30:37,791][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:30:38,110][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:30:38,428][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:30:38,747][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:30:39,065][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:30:39,383][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:30:39,702][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:30:40,021][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:30:40,340][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:30:40,659][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:30:40,978][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:30:41,298][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:30:41,617][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:30:41,937][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:30:42,256][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:30:42,576][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:30:42,896][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:30:43,214][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:30:43,534][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:30:43,853][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:30:44,172][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:30:44,491][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:30:44,810][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:30:45,129][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:30:45,447][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:30:45,765][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:30:46,085][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:30:46,404][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:30:46,723][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:30:47,042][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:30:47,361][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:30:47,679][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:30:47,999][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:30:48,621][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:30:48,940][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:30:49,260][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:30:49,579][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:30:49,897][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:30:50,216][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:30:50,535][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:30:50,855][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:30:51,174][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:30:51,493][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:30:51,813][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:30:52,133][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:30:52,452][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:30:53,127][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:30:53,873][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:30:53,875][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:30:53,876][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:30:54,602][__main__][INFO] - Iteration 379 took 27s (11.79% Gen, 85.54% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 38m 44s. Estimated total time: 7h 34m 34s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 27s, 500 more iterations: 3h 47m 17s. [2026-03-25 18:30:54,604][__main__][INFO] - Starting iteration 379. [2026-03-25 18:30:54,607][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:30:54,607][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:30:57,823][__main__][INFO] - Number of regex retries in iteration 379: 0 [2026-03-25 18:30:57,824][__main__][INFO] - agents played in iteration 379 are Alice, Bob [2026-03-25 18:30:58,361][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:30:59,023][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:30:59,312][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:30:59,632][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:30:59,950][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:31:00,268][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:31:00,586][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:31:00,905][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:31:01,223][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:31:01,542][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:31:01,860][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:31:02,179][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:31:02,499][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:31:02,819][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:31:03,137][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:31:03,456][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:31:03,775][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:31:04,094][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:31:04,414][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:31:04,733][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:31:05,053][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:31:05,373][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:31:05,692][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:31:06,010][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:31:06,330][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:31:06,650][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:31:06,969][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:31:07,288][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:31:07,607][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:31:07,925][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:31:08,243][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:31:08,561][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:31:08,879][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:31:09,198][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:31:09,517][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:31:09,835][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:31:10,154][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:31:10,473][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:31:10,793][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:31:11,112][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:31:11,431][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:31:11,751][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:31:12,069][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:31:12,390][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:31:12,709][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:31:13,027][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:31:13,346][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:31:13,665][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:31:13,983][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:31:14,301][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:31:14,619][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:31:14,938][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:31:15,257][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:31:15,875][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:31:16,194][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:31:16,513][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:31:16,832][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:31:17,152][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:31:17,471][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:31:17,791][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:31:18,110][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:31:18,429][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:31:18,750][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:31:19,070][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:31:19,389][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:31:19,709][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:31:20,385][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:31:21,136][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:31:21,138][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:31:21,141][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:31:21,856][__main__][INFO] - Iteration 380 took 27s (11.81% Gen, 85.57% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 37m 52s. Estimated total time: 7h 34m 10s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 5s. [2026-03-25 18:31:21,858][__main__][INFO] - Starting iteration 380. [2026-03-25 18:31:21,861][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:31:21,861][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:31:25,104][__main__][INFO] - Number of regex retries in iteration 380: 0 [2026-03-25 18:31:25,104][__main__][INFO] - agents played in iteration 380 are Alice, Bob [2026-03-25 18:31:25,646][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:31:26,308][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:31:26,598][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:31:26,917][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:31:27,237][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:31:27,556][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:31:27,875][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:31:28,195][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:31:28,514][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:31:28,834][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:31:29,153][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:31:29,471][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:31:29,789][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:31:30,108][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:31:30,427][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:31:30,745][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:31:31,064][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:31:31,383][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:31:31,702][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:31:32,021][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:31:32,340][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:31:32,659][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:31:32,978][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:31:33,297][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:31:33,615][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:31:33,935][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:31:34,254][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:31:34,573][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:31:34,892][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:31:35,210][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:31:35,528][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:31:35,847][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:31:36,166][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:31:36,485][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:31:36,804][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:31:37,123][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:31:37,443][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:31:37,761][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:31:38,079][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:31:38,398][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:31:38,717][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:31:39,035][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:31:39,354][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:31:39,672][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:31:39,991][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:31:40,310][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:31:40,629][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:31:40,948][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:31:41,267][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:31:41,585][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:31:41,905][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:31:42,223][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:31:42,542][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:31:43,159][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:31:43,477][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:31:43,797][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:31:44,117][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:31:44,435][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:31:44,754][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:31:45,073][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:31:45,391][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:31:45,710][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:31:46,030][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:31:46,349][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:31:46,668][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:31:46,987][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:31:47,651][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:31:48,390][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:31:48,392][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:31:48,394][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:31:49,037][__main__][INFO] - Iteration 381 took 27s (11.93% Gen, 85.69% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 36m 12s. Estimated total time: 7h 32m 57s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 17s, 500 more iterations: 3h 46m 28s. [2026-03-25 18:31:49,039][__main__][INFO] - Starting iteration 381. [2026-03-25 18:31:49,042][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:31:49,043][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:31:49,520][mllm.models.large_language_model_local][WARNING] - Response ` Assassination` did not match regex: (|), retry 1/1 [2026-03-25 18:31:50,966][mllm.models.large_language_model_local][WARNING] - Response user Last round, the other agent played . did not match regex: (|), retry 1/1 [2026-03-25 18:31:52,380][__main__][INFO] - Number of regex retries in iteration 381: 2 [2026-03-25 18:31:52,381][__main__][INFO] - agents played in iteration 381 are Alice, Bob [2026-03-25 18:31:52,916][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:31:53,568][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:31:53,857][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:31:54,177][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:31:54,496][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:31:54,814][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:31:55,132][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:31:55,451][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:31:55,770][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:31:56,089][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:31:56,407][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:31:56,726][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:31:57,045][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:31:57,364][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:31:57,682][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:31:58,001][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:31:58,320][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:31:58,639][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:31:58,958][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:31:59,277][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:31:59,596][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:31:59,915][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:32:00,235][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:32:00,554][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:32:00,873][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:32:01,192][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:32:01,511][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:32:01,831][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:32:02,149][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:32:02,468][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:32:02,787][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:32:03,106][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:32:03,425][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:32:03,744][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:32:04,063][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:32:04,382][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:32:04,701][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:32:05,020][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:32:05,339][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:32:05,659][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:32:05,979][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:32:06,299][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:32:06,616][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:32:06,935][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:32:07,253][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:32:07,571][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:32:07,890][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:32:08,209][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:32:08,527][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:32:08,846][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:32:09,164][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:32:09,483][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:32:09,801][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:32:10,413][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:32:10,732][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:32:11,051][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:32:11,370][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:32:11,689][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:32:12,008][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:32:12,328][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:32:12,647][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:32:12,966][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:32:13,284][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:32:13,603][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:32:13,921][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:32:14,241][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:32:14,897][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:32:15,637][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:32:15,639][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:32:15,641][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:32:16,262][__main__][INFO] - Iteration 382 took 27s (12.26% Gen, 85.45% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 36m 28s. Estimated total time: 7h 33m 40s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 22s, 500 more iterations: 3h 46m 50s. [2026-03-25 18:32:16,264][__main__][INFO] - Starting iteration 382. [2026-03-25 18:32:16,267][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:32:16,267][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:32:19,482][__main__][INFO] - Number of regex retries in iteration 382: 0 [2026-03-25 18:32:19,483][__main__][INFO] - agents played in iteration 382 are Alice, Bob [2026-03-25 18:32:20,021][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:32:20,674][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:32:20,962][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:32:21,283][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:32:21,601][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:32:21,919][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:32:22,237][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:32:22,556][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:32:22,875][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:32:23,193][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:32:23,511][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:32:23,829][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:32:24,148][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:32:24,467][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:32:24,786][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:32:25,105][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:32:25,424][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:32:25,742][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:32:26,060][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:32:26,379][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:32:26,698][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:32:27,017][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:32:27,336][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:32:27,656][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:32:27,976][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:32:28,295][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:32:28,613][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:32:28,933][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:32:29,252][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:32:29,572][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:32:29,892][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:32:30,212][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:32:30,532][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:32:30,851][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:32:31,171][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:32:31,490][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:32:31,810][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:32:32,130][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:32:32,450][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:32:32,770][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:32:33,090][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:32:33,410][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:32:33,730][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:32:34,050][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:32:34,370][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:32:34,689][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:32:35,008][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:32:35,327][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:32:35,647][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:32:35,967][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:32:36,287][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:32:36,607][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:32:36,927][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:32:37,545][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:32:37,865][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:32:38,185][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:32:38,504][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:32:38,824][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:32:39,144][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:32:39,465][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:32:39,785][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:32:40,103][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:32:40,421][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:32:40,741][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:32:41,059][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:32:41,379][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:32:42,035][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:32:42,814][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:32:42,817][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:32:42,820][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:32:43,517][__main__][INFO] - Iteration 383 took 27s (11.80% Gen, 85.64% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 36m 32s. Estimated total time: 7h 34m 11s. Time estimates for 10 more iterations: 4m 32s, 100 more iterations: 45m 25s, 500 more iterations: 3h 47m 5s. [2026-03-25 18:32:43,519][__main__][INFO] - Starting iteration 383. [2026-03-25 18:32:43,522][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:32:43,523][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:32:46,754][__main__][INFO] - Number of regex retries in iteration 383: 0 [2026-03-25 18:32:46,755][__main__][INFO] - agents played in iteration 383 are Alice, Bob [2026-03-25 18:32:47,289][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:32:47,942][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:32:48,231][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:32:48,551][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:32:48,869][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:32:49,188][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:32:49,506][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:32:49,824][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:32:50,143][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:32:50,462][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:32:50,781][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:32:51,100][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:32:51,420][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:32:51,739][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:32:52,057][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:32:52,377][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:32:52,695][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:32:53,014][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:32:53,333][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:32:53,651][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:32:53,971][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:32:54,290][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:32:54,608][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:32:54,927][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:32:55,246][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:32:55,564][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:32:55,882][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:32:56,202][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:32:56,521][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:32:56,839][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:32:57,158][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:32:57,477][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:32:57,796][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:32:58,116][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:32:58,435][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:32:58,755][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:32:59,075][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:32:59,394][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:32:59,713][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:33:00,032][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:33:00,352][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:33:00,671][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:33:00,990][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:33:01,309][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:33:01,628][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:33:01,946][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:33:02,265][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:33:02,584][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:33:02,904][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:33:03,223][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:33:03,542][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:33:03,861][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:33:04,180][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:33:04,793][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:33:05,112][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:33:05,430][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:33:05,749][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:33:06,068][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:33:06,386][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:33:06,705][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:33:07,024][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:33:07,342][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:33:07,660][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:33:07,980][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:33:08,299][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:33:08,618][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:33:09,275][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:33:10,009][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:33:10,011][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:33:10,014][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:33:10,656][__main__][INFO] - Iteration 384 took 27s (11.91% Gen, 85.72% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 34m 8s. Estimated total time: 7h 32m 15s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 7s. [2026-03-25 18:33:10,659][__main__][INFO] - Starting iteration 384. [2026-03-25 18:33:10,663][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:33:10,664][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:33:13,888][__main__][INFO] - Number of regex retries in iteration 384: 0 [2026-03-25 18:33:13,889][__main__][INFO] - agents played in iteration 384 are Alice, Bob [2026-03-25 18:33:14,427][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:33:15,079][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:33:15,368][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:33:15,688][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:33:16,007][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:33:16,326][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:33:16,644][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:33:16,962][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:33:17,281][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:33:17,600][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:33:17,918][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:33:18,237][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:33:18,556][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:33:18,876][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:33:19,195][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:33:19,515][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:33:19,834][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:33:20,153][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:33:20,473][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:33:20,791][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:33:21,110][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:33:21,428][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:33:21,748][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:33:22,067][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:33:22,387][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:33:22,705][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:33:23,025][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:33:23,343][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:33:23,661][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:33:23,980][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:33:24,298][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:33:24,618][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:33:24,937][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:33:25,256][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:33:25,575][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:33:25,894][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:33:26,213][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:33:26,532][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:33:26,850][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:33:27,170][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:33:27,489][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:33:27,808][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:33:28,127][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:33:28,446][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:33:28,765][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:33:29,084][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:33:29,402][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:33:29,722][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:33:30,041][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:33:30,360][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:33:30,679][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:33:30,997][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:33:31,316][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:33:31,929][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:33:32,247][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:33:32,566][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:33:32,885][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:33:33,204][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:33:33,522][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:33:33,841][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:33:34,159][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:33:34,478][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:33:34,796][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:33:35,116][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:33:35,436][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:33:35,756][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:33:36,411][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:33:37,142][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:33:37,144][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:33:37,145][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:33:37,780][__main__][INFO] - Iteration 385 took 27s (11.89% Gen, 85.76% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 33m 25s. Estimated total time: 7h 31m 59s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 11s, 500 more iterations: 3h 45m 59s. [2026-03-25 18:33:37,782][__main__][INFO] - Starting iteration 385. [2026-03-25 18:33:37,785][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:33:37,785][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:33:40,980][__main__][INFO] - Number of regex retries in iteration 385: 0 [2026-03-25 18:33:40,981][__main__][INFO] - agents played in iteration 385 are Alice, Bob [2026-03-25 18:33:41,518][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:33:42,172][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:33:42,462][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:33:42,781][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:33:43,100][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:33:43,420][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:33:43,739][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:33:44,057][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:33:44,377][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:33:44,698][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:33:45,017][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:33:45,337][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:33:45,658][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:33:45,977][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:33:46,297][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:33:46,617][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:33:46,937][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:33:47,256][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:33:47,576][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:33:47,896][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:33:48,216][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:33:48,536][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:33:48,856][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:33:49,175][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:33:49,494][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:33:49,813][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:33:50,134][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:33:50,454][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:33:50,775][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:33:51,095][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:33:51,415][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:33:51,735][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:33:52,054][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:33:52,374][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:33:52,694][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:33:53,014][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:33:53,334][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:33:53,653][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:33:53,972][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:33:54,290][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:33:54,608][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:33:54,927][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:33:55,246][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:33:55,564][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:33:55,883][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:33:56,201][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:33:56,520][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:33:56,838][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:33:57,158][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:33:57,477][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:33:57,796][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:33:58,115][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:33:58,434][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:33:59,046][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:33:59,364][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:33:59,684][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:34:00,002][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:34:00,321][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:34:00,641][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:34:00,960][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:34:01,280][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:34:01,598][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:34:01,917][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:34:02,237][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:34:02,555][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:34:02,874][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:34:03,531][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:34:04,290][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:34:04,292][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:34:04,293][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:34:04,936][__main__][INFO] - Iteration 386 took 27s (11.77% Gen, 85.86% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 33m 32s. Estimated total time: 7h 32m 32s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 15s, 500 more iterations: 3h 46m 16s. [2026-03-25 18:34:04,939][__main__][INFO] - Starting iteration 386. [2026-03-25 18:34:04,942][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:34:04,942][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:34:08,166][__main__][INFO] - Number of regex retries in iteration 386: 0 [2026-03-25 18:34:08,166][__main__][INFO] - agents played in iteration 386 are Alice, Bob [2026-03-25 18:34:08,695][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:34:09,349][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:34:09,636][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:34:09,955][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:34:10,273][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:34:10,592][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:34:10,911][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:34:11,230][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:34:11,549][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:34:11,868][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:34:12,186][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:34:12,505][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:34:12,823][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:34:13,142][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:34:13,460][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:34:13,779][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:34:14,097][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:34:14,417][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:34:14,737][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:34:15,056][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:34:15,376][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:34:15,696][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:34:16,015][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:34:16,334][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:34:16,653][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:34:16,971][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:34:17,289][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:34:17,608][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:34:17,926][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:34:18,245][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:34:18,563][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:34:18,883][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:34:19,202][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:34:19,521][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:34:19,840][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:34:20,159][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:34:20,478][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:34:20,797][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:34:21,115][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:34:21,434][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:34:21,752][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:34:22,070][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:34:22,389][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:34:22,708][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:34:23,027][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:34:23,346][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:34:23,665][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:34:23,984][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:34:24,303][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:34:24,622][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:34:24,941][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:34:25,260][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:34:25,579][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:34:26,191][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:34:26,511][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:34:26,829][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:34:27,149][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:34:27,468][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:34:27,788][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:34:28,107][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:34:28,425][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:34:28,744][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:34:29,062][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:34:29,382][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:34:29,701][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:34:30,020][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:34:30,676][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:34:31,425][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:34:31,427][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:34:31,429][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:34:32,072][__main__][INFO] - Iteration 387 took 27s (11.88% Gen, 85.74% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 32m 43s. Estimated total time: 7h 32m 11s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 13s, 500 more iterations: 3h 46m 5s. [2026-03-25 18:34:32,074][__main__][INFO] - Starting iteration 387. [2026-03-25 18:34:32,077][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:34:32,078][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:34:35,272][__main__][INFO] - Number of regex retries in iteration 387: 0 [2026-03-25 18:34:35,273][__main__][INFO] - agents played in iteration 387 are Alice, Bob [2026-03-25 18:34:35,803][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:34:36,454][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:34:36,743][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:34:37,063][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:34:37,381][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:34:37,700][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256 [2026-03-25 18:34:38,019][mllm.training.trainer_common][INFO] - Processing mini-batch 20 of 256 [2026-03-25 18:34:38,338][mllm.training.trainer_common][INFO] - Processing mini-batch 24 of 256 [2026-03-25 18:34:38,658][mllm.training.trainer_common][INFO] - Processing mini-batch 28 of 256 [2026-03-25 18:34:38,977][mllm.training.trainer_common][INFO] - Processing mini-batch 32 of 256 [2026-03-25 18:34:39,296][mllm.training.trainer_common][INFO] - Processing mini-batch 36 of 256 [2026-03-25 18:34:39,615][mllm.training.trainer_common][INFO] - Processing mini-batch 40 of 256 [2026-03-25 18:34:39,933][mllm.training.trainer_common][INFO] - Processing mini-batch 44 of 256 [2026-03-25 18:34:40,251][mllm.training.trainer_common][INFO] - Processing mini-batch 48 of 256 [2026-03-25 18:34:40,570][mllm.training.trainer_common][INFO] - Processing mini-batch 52 of 256 [2026-03-25 18:34:40,889][mllm.training.trainer_common][INFO] - Processing mini-batch 56 of 256 [2026-03-25 18:34:41,207][mllm.training.trainer_common][INFO] - Processing mini-batch 60 of 256 [2026-03-25 18:34:41,526][mllm.training.trainer_common][INFO] - Processing mini-batch 64 of 256 [2026-03-25 18:34:41,845][mllm.training.trainer_common][INFO] - Processing mini-batch 68 of 256 [2026-03-25 18:34:42,165][mllm.training.trainer_common][INFO] - Processing mini-batch 72 of 256 [2026-03-25 18:34:42,484][mllm.training.trainer_common][INFO] - Processing mini-batch 76 of 256 [2026-03-25 18:34:42,802][mllm.training.trainer_common][INFO] - Processing mini-batch 80 of 256 [2026-03-25 18:34:43,120][mllm.training.trainer_common][INFO] - Processing mini-batch 84 of 256 [2026-03-25 18:34:43,438][mllm.training.trainer_common][INFO] - Processing mini-batch 88 of 256 [2026-03-25 18:34:43,757][mllm.training.trainer_common][INFO] - Processing mini-batch 92 of 256 [2026-03-25 18:34:44,077][mllm.training.trainer_common][INFO] - Processing mini-batch 96 of 256 [2026-03-25 18:34:44,396][mllm.training.trainer_common][INFO] - Processing mini-batch 100 of 256 [2026-03-25 18:34:44,715][mllm.training.trainer_common][INFO] - Processing mini-batch 104 of 256 [2026-03-25 18:34:45,034][mllm.training.trainer_common][INFO] - Processing mini-batch 108 of 256 [2026-03-25 18:34:45,353][mllm.training.trainer_common][INFO] - Processing mini-batch 112 of 256 [2026-03-25 18:34:45,673][mllm.training.trainer_common][INFO] - Processing mini-batch 116 of 256 [2026-03-25 18:34:45,992][mllm.training.trainer_common][INFO] - Processing mini-batch 120 of 256 [2026-03-25 18:34:46,311][mllm.training.trainer_common][INFO] - Processing mini-batch 124 of 256 [2026-03-25 18:34:46,630][mllm.training.trainer_common][INFO] - Processing mini-batch 128 of 256 [2026-03-25 18:34:46,949][mllm.training.trainer_common][INFO] - Processing mini-batch 132 of 256 [2026-03-25 18:34:47,268][mllm.training.trainer_common][INFO] - Processing mini-batch 136 of 256 [2026-03-25 18:34:47,586][mllm.training.trainer_common][INFO] - Processing mini-batch 140 of 256 [2026-03-25 18:34:47,905][mllm.training.trainer_common][INFO] - Processing mini-batch 144 of 256 [2026-03-25 18:34:48,224][mllm.training.trainer_common][INFO] - Processing mini-batch 148 of 256 [2026-03-25 18:34:48,542][mllm.training.trainer_common][INFO] - Processing mini-batch 152 of 256 [2026-03-25 18:34:48,860][mllm.training.trainer_common][INFO] - Processing mini-batch 156 of 256 [2026-03-25 18:34:49,178][mllm.training.trainer_common][INFO] - Processing mini-batch 160 of 256 [2026-03-25 18:34:49,497][mllm.training.trainer_common][INFO] - Processing mini-batch 164 of 256 [2026-03-25 18:34:49,817][mllm.training.trainer_common][INFO] - Processing mini-batch 168 of 256 [2026-03-25 18:34:50,136][mllm.training.trainer_common][INFO] - Processing mini-batch 172 of 256 [2026-03-25 18:34:50,455][mllm.training.trainer_common][INFO] - Processing mini-batch 176 of 256 [2026-03-25 18:34:50,773][mllm.training.trainer_common][INFO] - Processing mini-batch 180 of 256 [2026-03-25 18:34:51,092][mllm.training.trainer_common][INFO] - Processing mini-batch 184 of 256 [2026-03-25 18:34:51,412][mllm.training.trainer_common][INFO] - Processing mini-batch 188 of 256 [2026-03-25 18:34:51,732][mllm.training.trainer_common][INFO] - Processing mini-batch 192 of 256 [2026-03-25 18:34:52,051][mllm.training.trainer_common][INFO] - Processing mini-batch 196 of 256 [2026-03-25 18:34:52,369][mllm.training.trainer_common][INFO] - Processing mini-batch 200 of 256 [2026-03-25 18:34:52,688][mllm.training.trainer_common][INFO] - Processing mini-batch 204 of 256 [2026-03-25 18:34:53,300][mllm.training.trainer_common][INFO] - Processing mini-batch 208 of 256 [2026-03-25 18:34:53,618][mllm.training.trainer_common][INFO] - Processing mini-batch 212 of 256 [2026-03-25 18:34:53,937][mllm.training.trainer_common][INFO] - Processing mini-batch 216 of 256 [2026-03-25 18:34:54,256][mllm.training.trainer_common][INFO] - Processing mini-batch 220 of 256 [2026-03-25 18:34:54,576][mllm.training.trainer_common][INFO] - Processing mini-batch 224 of 256 [2026-03-25 18:34:54,894][mllm.training.trainer_common][INFO] - Processing mini-batch 228 of 256 [2026-03-25 18:34:55,214][mllm.training.trainer_common][INFO] - Processing mini-batch 232 of 256 [2026-03-25 18:34:55,534][mllm.training.trainer_common][INFO] - Processing mini-batch 236 of 256 [2026-03-25 18:34:55,853][mllm.training.trainer_common][INFO] - Processing mini-batch 240 of 256 [2026-03-25 18:34:56,171][mllm.training.trainer_common][INFO] - Processing mini-batch 244 of 256 [2026-03-25 18:34:56,491][mllm.training.trainer_common][INFO] - Processing mini-batch 248 of 256 [2026-03-25 18:34:56,809][mllm.training.trainer_common][INFO] - Processing mini-batch 252 of 256 [2026-03-25 18:34:57,129][mllm.training.trainer_common][INFO] - Accumulated the policy gradient loss for 7680 tokens. [2026-03-25 18:34:57,784][mllm.training.trainer_common][INFO] - For task: Apply reinforce step, ΔVRAM % (total): 6.26%, Current % of VRAM taken: 45.79%, Block Peak % of device VRAM: 26.60%, ΔTime: 00:00:21 [2026-03-25 18:34:58,519][mllm.training.trainer_common][INFO] - Saved main optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/policy_optimizer_state.pt [2026-03-25 18:34:58,521][mllm.training.trainer_common][INFO] - Saved critic optimizer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/critic_optimizer_state.pt [2026-03-25 18:34:58,523][mllm.training.trainer_common][INFO] - Saved trainer state to /scratch/muqeeth/llm_negotiation/2026_03/ipd_naive_seed1337/seed_1337/agent_trainer/trainer_annealing_state.pkl [2026-03-25 18:34:59,180][__main__][INFO] - Iteration 388 took 27s (11.79% Gen, 85.78% Train). Generation: 3s, Training: 23s. Estimated remaining time: 4h 31m 49s. Estimated total time: 7h 31m 43s. Time estimates for 10 more iterations: 4m 31s, 100 more iterations: 45m 10s, 500 more iterations: 3h 45m 51s. [2026-03-25 18:34:59,182][__main__][INFO] - Starting iteration 388. [2026-03-25 18:34:59,185][__main__][INFO] - Inference policies count is regular policies 2 and buffer policies 7 and human policies 1. [2026-03-25 18:34:59,186][__main__][INFO] - Hard coded buffer agents are set to False with prob 0 [2026-03-25 18:35:02,450][__main__][INFO] - Number of regex retries in iteration 388: 0 [2026-03-25 18:35:02,451][__main__][INFO] - agents played in iteration 388 are Alice, Bob [2026-03-25 18:35:02,987][mllm.training.trainer_independent][INFO] - Sharing advantage data. [2026-03-25 18:35:03,649][mllm.training.trainer_common][INFO] - Processing mini-batch 0 of 256 [2026-03-25 18:35:03,938][mllm.training.trainer_common][INFO] - Processing mini-batch 4 of 256 [2026-03-25 18:35:04,258][mllm.training.trainer_common][INFO] - Processing mini-batch 8 of 256 [2026-03-25 18:35:04,577][mllm.training.trainer_common][INFO] - Processing mini-batch 12 of 256 [2026-03-25 18:35:04,897][mllm.training.trainer_common][INFO] - Processing mini-batch 16 of 256